require(plotly)
require(hierfstat)
require(adegenet)
require(PopGenReport)
require(pheatmap)
require(tidyverse)
require(DiagrammeR)
require(poppr)
require(genepop)
require(graph4lg)
require(knitr)
require(pegas)
require(cowplot)

Readme

This is an rstudio project. If you’d like to pre-rendered figures, read a summary of analysis and view code, please open the html file in a browser.

To conduct the analyses on your computer, edit or run code: clone this repository into a directory on you r local machine and open the .Rproj file in Rstudio. All data and analyses are available in the github repository at https://github.com/david-dayan/rogue_half_pounder.git

Rationale

In addition to winter early-summer and late-summer runs, some steelhead on the Rogue express a “half-pounder” life-history. Half-pounders exhibit a false-spawning run (although some precocious males may spawn) as juveniles during the summer months, then reutrn to sea to continue the marine growth phase. There is great interest in half-pounders from a management perspective, because (i) half pounder abundance should integrate juvenile freshwater and early ocean conditions for steelhead regardless of whether they express the half pounder phenotype (ii) half pounder abundance is predictive of steelhead abundance in historical dam passage counts, and (iii) half-pounders are a unique fishery of great cultural and economic value. However, the relative proportion of winter vs early summer vs late-summer run life histories expressed by half pounders later in life is unknown. Furthermore there is limited genetic information about the three adult runs of steelhead on the Rogue.

This study attempts to use neutral and adaptive GTseq markers to examine population structure and run timing among Rogue River steelhead. We also attempt to classify half-pounders into run-timing groups on the basis of run-timing associated genetic markers.

Data Summary

Samples

Half Pounders
Samples were collected during ODFW’s Lower Rogue Seining Project in 2018 and 2019. The Lower Rogue Seining Project estimates escapement for Coho, late-summer run O. mykiss, half-pounder O. mykiss and fall chinook by beach seining near Huntley Park at approximate river mile 8, three times weekly from July through October. Half-pounder O. mykiss were identified as individuals with fork length 250 - 410mm and sampled in batches of up to 50 fish each day for 11 days from September 7th to October 1st 2018 totaling 384 individuals and 18 days from August 14th to September 25th 2019 totaling 331 individuals. Caudal fin clips were taken for DNA extraction, and placed in daily batch vials containing 95% ethanol. Note that due to batch collection of fin clips, that these numbers are inflated (some individuals have multiple tissue samples - identified later and filtered)

All half-pounders are non-marked and assumed to be natural origin fish.

Also ~5% of samples represented twice in GTseq library as QAQC samples

Adults
For brevity and easy code comprehension, throughout the notebook early-summer run steelhead are referred to as summer run and late-summer run steelhead are referred to as fall run.

45 winter and 45 early-run summer run fish. Adult summer steelhead were sampled at the Cole River Hatchery sorting pond on 6/26/2019 and (b) adult winter steelhead were sampled at Cole Rivers Hatchery (Rogue River) and the Applegate River from adult brood stock for 2019. Finally 166 additional late-summer run fish (fall run) were sampled at the Huntley Park seine on the lower Rogue.

All but 4 early-summer run adults are marked and assumed to hatchery origin fish. The final filtered dataset contained 1 natural origin and 41 hatchery origin fish. Winter run fish include NOR and HOR fish. After filtering the final dataset contains 18 HOR and 22 NOR fish. Early summer run fish are all unmarked and assumed NOR. All late-summer fish were unmarked and assumed NOR.

Genotype Data

Information about sequencing data, genotype calling and filtering is available in the relevant R notebook (half_pounder_genotyping_notebook.Rmd) in this repository.

load("./genotype_data/genind_2.0.R")
load("./genotype_data/genotypes_2.2.R")
genos_2.2 <- ungroup(genos_2.2)
kable(genos_2.2 %>%
  group_by(run, year) %>%
  tally(), caption = "Final Filtered Dataset")
Final Filtered Dataset
run year n
fall 2019 157
halfpounder 2018 338
halfpounder 2019 305
Summer 2019 42
Winter 2019 40
#save a file with this info
progeny <- readxl::read_xlsx("../meta_data/2019_summer/Omy Rogue2019 steelhead datasheets.xlsx", sheet = 4) 
progeny <- progeny %>%
  select(SFGL_ID, Origin)
select(genos_2.2, Sample, run, year, Date) %>%
  left_join(progeny, by = c("Sample" = "SFGL_ID")) %>%
  mutate(Origin = replace(Origin, Origin == "AD", "HOR")) %>%
  mutate(Origin = replace(Origin, Origin == "1", "NOR")) %>%
  replace_na(list(Origin = "NOR")) %>%
  write_tsv("sample_metadata.txt")#convert AD (HOR) and 1 (NOR)

Marker info

Throughout the notebook we refer to subsets of markers based on annotations provided by CRITFC. To my knowledge these annotations come from decisions made when the panel was assembled and edited (i.e. a run timing associated marker came out of a GWAS fro run timing). Broadly we examine 4 subsets of markers:

  1. All markers: all 390 GTseq markers available in the full SFGL O mykiss GTseq panel. (The dataset here has fewer markers due to filtering - see genotyping notebook) (355 markers)
  2. Neutral Markers: Any markers annotated as “neutral,” with no other annotations. From my understanding these were mostly selected from broad spatial scale RADseq studies that refelct broad scale geographic structure and may or may not be in linkage with unkownn adaptive loci. (226 markers)
  3. Adaptive Markers: any markers with a known adaptive functional annotation. (129 markers)
  4. Run-timing markers: a subset of adaptive markers with specific associations with run-timing in O. mykiss. (14 markers)

The table below demonstrates the annotations

marker_mapping <- readxl::read_xlsx("./metadata/final_mapping_results.xlsx", sheet = 1)

marker_summary <- marker_mapping %>%
  mutate(marker = str_replace(marker, "Omy(\\d+)", "Chr\\1")) %>%
  filter(marker %in% colnames(genos_2.2)) %>%
  mutate(neutral = if_else(str_detect(`Presumed Type`, 'neutral|Neutral'), "neutral", "adaptive")) %>%
  mutate(run_timing = if_else(str_detect(`Presumed Type`, 'run|Run'), "run_timing", "non-run_timing")) %>%
  dplyr::select(marker, chr, start, 'Presumed Type', neutral, run_timing, Source)

DT::datatable(marker_summary, options = list(pageLength=10))

Workflow

Our primary goals:
(a) examine population structure within and among the three run timing groups in the Rogue River (early summer, late summer and winter runs) (late summer referred to as fall in notebook)
(b) classify half-pounders into run timing group.

To begin, we will collect nucleotide diversity index at the marker and population levels, examine population structure at neutral and adaptive markers using multivariate and bayesian approaches, and assign half pounders to run timing groups based on a genetic classifier.

Diversity Metrics

Here we calculate both per-marker and per-population diversity metrics (Ho He HWE Fis and Fst)

Heterozygosity

Heterogyzity metrics

# first Ho and He
n.pop <- seppop(genind_2.0)

hobs <- lapply(n.pop, function(x) (summary(x)$Hobs))
hobs <- as.data.frame(t(do.call(rbind, hobs)))
hobs <- hobs %>%
  rownames_to_column(var="marker")
hobs <- hobs %>%
  pivot_longer(-marker, names_to = "run", values_to = "Ho")
#ggplot(hobs)+geom_boxplot(aes(x=pop, y=hobs))+theme_classic()+xlab("Run Timing")+ylab("Observed Heterozygosity")

hexp <- lapply(n.pop, function(x) (summary(x)$Hexp))
hexp <- as.data.frame(t(do.call(rbind, hexp)))
hexp <- hexp %>%
  rownames_to_column(var="marker") %>%
  pivot_longer(-marker, names_to = "run", values_to = "He")

marker_divs <- hexp %>%
  left_join(hobs)

#now lets add the annotation status (neutral vs adaptive)

marker_mapping <- readxl::read_xlsx("./metadata/final_mapping_results.xlsx", sheet = 1)

marker_divs <- marker_mapping %>%
  select(marker, `Presumed Type`) %>%
  mutate(marker = str_replace(marker, "Omy(\\d+)", "Chr\\1")) %>% #marker name convention is different
  mutate(neutral = if_else(str_detect(`Presumed Type`, 'Adaptive|adaptive'), "adaptive", "neutral")) %>%
  right_join(marker_divs)

# lets add known run-timing marker as a grouping variable, and conver to longer format
marker_divs <-  marker_divs %>%
  mutate(run_timing_marker = if_else(str_detect(`Presumed Type`, 'Run|run'), "run" , "non-run")) %>%
   pivot_longer(c("Ho", "He"), names_to = "obs_exp", values_to ="H")

#nice, now lets make a long table


# lets throw up some nice plots here
# first for all markers
ggplot(data=marker_divs)+geom_boxplot(aes(x=run, y=H, fill = obs_exp))+scale_fill_viridis_d()+theme_classic()+ggtitle("all markers")

ggplot(data=marker_divs[marker_divs$neutral=="neutral",])+geom_boxplot(aes(x=run, y=H, fill = obs_exp))+scale_fill_viridis_d()+theme_classic()+ggtitle("neutral markers")

ggplot(data=marker_divs[marker_divs$neutral=="adaptive",])+geom_boxplot(aes(x=run, y=H, fill = obs_exp))+scale_fill_viridis_d()+theme_classic()+ggtitle("adaptive markers")

ggplot(data=marker_divs[marker_divs$run_timing_marker=="run",])+geom_boxplot(aes(x=run, y=H, fill = obs_exp))+scale_fill_viridis_d()+theme_classic()+ggtitle("run-timing markers")

#publication plot fig. 1
#ggplot(data=drop_na(marker_divs[marker_divs$run_timing_marker=="run",]))+geom_boxplot(aes(x=run, y=H, fill = obs_exp))+scale_fill_viridis_d(labels = c(expression(H[E]), expression(H[O])), name = " ")+theme_classic()+xlab("")+scale_x_discrete(labels = c("Late-Summer", "Half-Pounder", "Early-Summer", "Winter"))

#publication plot suppl fig. 1
# a <- ggplot(data=drop_na(marker_divs[marker_divs$neutral=="neutral",]))+geom_boxplot(aes(x=run, y=H, fill = obs_exp))+scale_fill_viridis_d(labels = c(expression(H[E]), expression(H[O])), name = " ")+theme_classic()+xlab("")+scale_x_discrete(labels = c("Late-Summer", "Half-Pounder", "Early-Summer", "Winter"))+theme(legend.position = "none")
# 
# b <- ggplot(data=drop_na(marker_divs[marker_divs$neutral=="adaptive",]))+geom_boxplot(aes(x=run, y=H, fill = obs_exp))+scale_fill_viridis_d(labels = c(expression(H[E]), expression(H[O])), name = " ")+theme_classic()+xlab("")+scale_x_discrete(labels = c("Late-Summer", "Half-Pounder", "Early-Summer", "Winter"))+ylim(0,0.68)
# 
# plot_grid(a,b, rel_widths = c(0.83,1), labels = c("(a)", "(b)"))

#supplemental table
marker_divs %>%
  group_by(neutral, run, obs_exp) %>%
  summarise(mean = mean(H)) %>%
  print(n = 24)
## # A tibble: 24 x 4
## # Groups:   neutral, run [12]
##    neutral  run         obs_exp  mean
##    <chr>    <chr>       <chr>   <dbl>
##  1 adaptive fall        He      0.254
##  2 adaptive fall        Ho      0.245
##  3 adaptive halfpounder He      0.259
##  4 adaptive halfpounder Ho      0.235
##  5 adaptive Summer      He      0.215
##  6 adaptive Summer      Ho      0.219
##  7 adaptive Winter      He      0.226
##  8 adaptive Winter      Ho      0.221
##  9 neutral  fall        He      0.308
## 10 neutral  fall        Ho      0.302
## 11 neutral  halfpounder He      0.310
## 12 neutral  halfpounder Ho      0.300
## 13 neutral  Summer      He      0.304
## 14 neutral  Summer      Ho      0.297
## 15 neutral  Winter      He      0.311
## 16 neutral  Winter      Ho      0.308
## 17 <NA>     fall        He      0.333
## 18 <NA>     fall        Ho      0.350
## 19 <NA>     halfpounder He      0.334
## 20 <NA>     halfpounder Ho      0.329
## 21 <NA>     Summer      He      0.350
## 22 <NA>     Summer      Ho      0.402
## 23 <NA>     Winter      He      0.358
## 24 <NA>     Winter      Ho      0.321

Significance Testing

Are these differences between populations significant? What about differences between nuetral and adaptive sets of loci. To test with the same number of loci (across populations) we’ll use a monte-carlo test and 999 permutations. To test across different sets of loci (neutral vs adaptive), we’ll use a simple t-test

#lets make a way to easily called different snp classes 
#"neutral" markers 

neutral_loci_names <- marker_mapping %>%
  filter(str_detect(`Presumed Type`, 'neutral|Neutral')) %>%
  dplyr::select(marker)

#different naming convention, lets fix
neutral_loci_names <- str_replace(neutral_loci_names$marker, "Omy28", "Chr28")

#run timing markers
run_timing_loci_names <- marker_mapping %>%
  filter(chr == "28" | CRITFC_chromosome == "28") %>%
  filter(str_detect(`Presumed Type`, 'run|Run')) %>%
  select(marker)

#different naming convention, lets fix
run_timing_loci_names <- str_replace(run_timing_loci_names$marker, "Omy28", "Chr28")

## test for all pairwise differences at run-timing markers
#make a pop separted genind at run timing markers
chr28_seppop  <- seppop(genind_2.0[loc=run_timing_loci_names])


# fall-half
a <- Hs.test(chr28_seppop$fall, chr28_seppop$halfpounder)

#fall-summer
b <- Hs.test(chr28_seppop$fall, chr28_seppop$Summer)

#fall-winter
c <- Hs.test(chr28_seppop$fall, chr28_seppop$Winter)

#half-summer
d <- Hs.test(chr28_seppop$halfpounder, chr28_seppop$Summer)

#half-winter
e <- Hs.test(chr28_seppop$halfpounder, chr28_seppop$Winter)

#summer-winter
f <- Hs.test(chr28_seppop$Summer, chr28_seppop$Winter)

#now make a nice table
df <- as.data.frame(cbind(c("fall", "fall", "fall", "half", "half", "summer"), c("half", "summer", "winter", "summer", "winter", "winter"), c(a$pvalue,b$pvalue,c$pvalue,d$pvalue,e$pvalue,f$pvalue)))
colnames(df) <- c("pop1", "pop2","p-value")
kable(df, caption = "Monte-Carlo p-value for difference in observed Heterozygosity at run-timing markers")
Monte-Carlo p-value for difference in observed Heterozygosity at run-timing markers
pop1 pop2 p-value
fall half 0.942
fall summer 0.001
fall winter 0.001
half summer 0.001
half winter 0.001
summer winter 0.001
## Next lets look if there is increased/decreased diversity within a population at these markers compared to neutral
neutral_seppop  <- seppop(genind_2.0[loc=neutral_loci_names])


#fall
a <- t.test(summary(chr28_seppop$fall)$Hexp, summary(neutral_seppop$fall)$Hexp, alternative = "greater")

#half
b <- t.test(summary(chr28_seppop$halfpounder)$Hexp, summary(neutral_seppop$halfpounder)$Hexp, alternative = "less")


#summer
c <- t.test(summary(chr28_seppop$Summer)$Hexp, summary(neutral_seppop$Summer)$Hexp, alternative = "less")

#winter
d <- t.test(summary(chr28_seppop$Winter)$Hexp, summary(neutral_seppop$Winter)$Hexp, , alternative = "less")

#now make a nice table
df <- as.data.frame(cbind(c("fall", "half", "summer", "winter"),c(a$p.value,b$p.value,c$p.value,d$p.value)))
colnames(df) <- c("pop", "p-val")
kable(df, caption = "T-test p-value for difference He at run timing markers vs neutral markers")
T-test p-value for difference He at run timing markers vs neutral markers
pop p-val
fall 0.104853452489372
half 0.90187719688705
summer 9.52372472690406e-38
winter 1.28798220584235e-05

All pairwise population comparisons of He at chr28 are significantly different (p=0.001) EXCEPT fall-halfpounder. Winter and Summer populations demonstrate significantly reduced He at chr28 markers relative to neutral markers.

Are there significant departures from HWE at the loci level?

# here we use the hw.test function from pegas (exact test based on Monte Carlo permutations of alleles, 1000 permutations)
HWE.test <- lapply(n.pop, function(x) hw.test(x, B=1000))
# here we take the list of dataframes of p-values and combine into a single dataframe
hwe <- reduce(HWE.test, cbind)
hwe <- hwe[,c(4,8,12,16)]
colnames(hwe) <- c("fall", "halfpounder", "summer", "winter")
hwe <- as.data.frame(hwe)

# next we correct for multiple comparisons
p.adj <- as.data.frame(apply(hwe, MARGIN = 2, function(x) p.adjust(x, "fdr")))
hwe_exceed <- p.adj %>% rownames_to_column(var="marker") %>%
  pivot_longer(-marker, names_to = "run", values_to = "fdr") %>%
  filter(fdr < 0.1)

a <- hwe_exceed%>%
  group_by(run) %>%
  tally()

kable(a, caption = "Number of markers significantly out of HWE")
Number of markers significantly out of HWE
run n
fall 17
halfpounder 73
summer 1
winter 1

Yes, see table above for number of SNPs outside of HWE (fdr < 0.1) per “population”

#lets check if there is an enrichment of run-timing and adaptive markers in markers out of HWE in fall and half-pounders
hwe_exceed <- hwe_exceed %>%
  left_join(pivot_wider(marker_divs, names_from = obs_exp, values_from = H))

#lets get marker info, but only for markers in the panel
marker_mapping2 <- marker_mapping %>%
  mutate(marker = str_replace(marker, "Omy(\\d+)", "Chr\\1")) %>%
  filter(marker %in% colnames(genos_2.2))

# halfpunder enriched for run timing markers
a <- nrow(hwe_exceed[hwe_exceed$run=="halfpounder" & hwe_exceed$run_timing_marker=="run",])
c <- nrow(marker_mapping2)-a
b <- nrow(hwe_exceed[hwe_exceed$run=="halfpounder" & hwe_exceed$run_timing_marker!="run",])
d <- nrow(marker_mapping2)-b

hr <- fisher.test(matrix(c(a,b,c,d), nrow=2), alternative = "less")

#fall for run-timing markers
a <- nrow(hwe_exceed[hwe_exceed$run=="fall" & hwe_exceed$run_timing_marker=="run",])
c <- nrow(marker_mapping2)-a
b <- nrow(hwe_exceed[hwe_exceed$run=="fall" & hwe_exceed$run_timing_marker!="run",])
d <- nrow(marker_mapping2)-b

fr <- fisher.test(matrix(c(a,b,c,d), nrow=2), alternative = "less")

#halfpounder for adaptive markers
a <- nrow(hwe_exceed[hwe_exceed$run=="halfpounder" & hwe_exceed$neutral=="adaptive",])
c <- nrow(marker_mapping2)-a
b <- nrow(hwe_exceed[hwe_exceed$run=="halfpounder" & hwe_exceed$neutral!="adaptive",])
d <- nrow(marker_mapping2)-b

ha <- fisher.test(matrix(c(a,b,c,d), nrow=2), alternative = "less")

#fall for adaptive markers
a <- nrow(hwe_exceed[hwe_exceed$run=="fall" & hwe_exceed$neutral=="adaptive",])
c <- nrow(marker_mapping2)-a
b <- nrow(hwe_exceed[hwe_exceed$run=="fall" & hwe_exceed$neutral!="adaptive",])
d <- nrow(marker_mapping2)-b

fa <- fisher.test(matrix(c(a,b,c,d), nrow=2), alternative = "less")

#make a nice table
df <- as.data.frame(cbind(c("fall", "fall", "half", "half"),c("run", "all adaptive", "run", "all adaptive"),c(fr$p.value, fa$p.value, hr$p.value, ha$p.value), c(fr$estimate, fa$estimate, hr$estimate, ha$estimate)))
colnames(df) <- c("pop", "marker type", "p-val", "fold-enrichment")
kable(df, caption = "enrichment (fisher's exact test) of marker types among markers out of HWE")
enrichment (fisher’s exact test) of marker types among markers out of HWE
pop marker type p-val fold-enrichment
odds.ratio fall run 0.00104773851728624 0.128712763712802
odds.ratio.1 fall all adaptive 0.836808251720311 1.4402519484201
odds.ratio.2 half run 2.04720132834574e-10 0.165683797102741
odds.ratio.3 half all adaptive 0.45116739827645 0.941481291402832

Markers significantly out of HWE (Ho < He: excess of homozygotes) are enriched for run-timing markers, but not adaptive markers in the fall and half-pounder groups.

Now let’s make a nice visual representation of this information (fall and half pounder demonstrate a dearth of heterozygotes at run ting markers)

plot_data_half <- marker_divs %>%
  filter(run == "halfpounder") %>%
  pivot_wider( names_from = obs_exp, values_from = H)

ggplot(plot_data_half)+geom_point(aes(He, Ho, color = run_timing_marker))+geom_abline(aes(intercept=0, slope=1))+coord_equal(ratio=1)+ylim(0,0.6)+scale_color_viridis_d()+theme_classic()+ggtitle("halfpounder")

plot_data_fall <- marker_divs %>%
  filter(run == "fall") %>%
  pivot_wider( names_from = obs_exp, values_from = H)

ggplot(plot_data_fall)+geom_point(aes(He, Ho, color = run_timing_marker))+geom_abline(aes(intercept=0, slope=1))+coord_equal(ratio=1)+ylim(0,0.6)+scale_color_viridis_d()+theme_classic()+ggtitle("fall")

It’s surprising that there is enrichment of run timing markers among markers out of HWE in the fall run from these figures, but double checked the code, still significant enriched… Meanwhile the results in halfpounders are very clear.

Heterozygosity results summary

At neutral markers, there is a slight reduction of observed from expected heterozygosity, most pronounced among summer run fish. At run timing markers, there is a significant reduction in diversity (Ho) relative to neutral markers among the summer run and winter run fish (one-sided t-test), and increased diversity at fall run and half-pounders (not significant). Furthermore, there is an excess of homozygosity at run timing associated SNPs in the half-pounders (but not fall fish) suggesting some structure within this group. Among markers with significant departures from HWE within fall and half-pounders, there is a significant enrichment of run-timing associated markers (p = p-value = 1.707e-10 and p = p-value = 0.02129, odds ratios 6.513 and 5.63, respectively fisher’s exact test), but not significant enrichment for markers annotated as adaptive overall.

Fst

Let’s move on to f-statistics. We’ll use hierfstats for most of this work, so the first step is to convert to a hierfstat format. Then we’ll calculate some basic statists and Fst (Nei)

fstat <- genind2hierfstat(genind_2.0)
colnames(fstat) <- c(pop, names(genind_2.0$loc.n.all))
# and also lets make datasets at different sets of loci, because hierfstat doesn't easily retain loci names for later use
fstat_neutral <- genind2hierfstat(genind_2.0[loc=neutral_loci_names])
colnames(fstat_neutral) <- c(pop, names(genind_2.0[loc=neutral_loci_names]$loc.n.all))
fstat_run_timing <- genind2hierfstat(genind_2.0[loc=run_timing_loci_names])
colnames(fstat_run_timing) <- c(pop, names(genind_2.0[loc=run_timing_loci_names]$loc.n.all))

#calculate datset wide basic stats
basicstats <- basic.stats(fstat)
basicstats_neutral <- basic.stats(fstat_neutral)
basicstats_run_timing <- basic.stats(fstat_run_timing)

kable(basicstats$overall, caption = "All markers Fstats")
All markers Fstats
x
Ho 0.2801
Hs 0.2888
Ht 0.2974
Dst 0.0086
Htp 0.3003
Dstp 0.0114
Fst 0.0288
Fstp 0.0381
Fis 0.0304
Dest 0.0161
kable(basicstats_neutral$overall, caption = "Neutral marker Fstats")
Neutral marker Fstats
x
Ho 0.3045
Hs 0.3135
Ht 0.3149
Dst 0.0014
Htp 0.3154
Dstp 0.0018
Fst 0.0043
Fstp 0.0058
Fis 0.0288
Dest 0.0026
kable(basicstats_run_timing$overall, caption = "Run Timing Marker Fstats")
Run Timing Marker Fstats
x
Ho 0.1860
Hs 0.2118
Ht 0.3917
Dst 0.1799
Htp 0.4516
Dstp 0.2399
Fst 0.4593
Fstp 0.5311
Fis 0.1218
Dest 0.3043

Marker level Fst

Now let’s look at the distribution of Fst and check if markers with certain annotations are enriched among outliers.

basicstats$perloc$run_timing <- if_else(rownames(basicstats$perloc) %in% run_timing_loci_names, "run", "non-run")
basicstats$perloc$neutral <- if_else(rownames(basicstats$perloc) %in% neutral_loci_names, "neutral", "adaptive")

ggplot(basicstats$perloc)+geom_histogram(aes(x=Fst, fill=neutral))+theme_classic()

ggplot(basicstats$perloc)+geom_histogram(aes(x=Fst, fill=run_timing))+theme_classic()

#check if any non-run timing markers have high Fst
#max((basicstats$perloc[basicstats$perloc$run_timing=="non-run",])$Fst, na.rm = TRUE)

Yes, there are some clear fst outlier loci in the dataset. These outliers are composed solely of known run-timing markers, but not all run-timing markers demostrate high differentiation. The maximum fst on a non-run-timing marker was 0.0486

Pairwise Population Fst

Let’s collect genetic distance info on pairs of pops as well (Weir and Cochran 1984).

First, the full dataset:

genet.dist(fstat, method="WC84")
##                    fall halfpounder      Summer
## halfpounder 0.001180695                        
## Summer      0.041715265 0.042413548            
## Winter      0.022561337 0.016945631 0.097702619

Then just neutral loci:

genet.dist(fstat_neutral, method="WC84")
##                    fall halfpounder      Summer
## halfpounder 0.000193948                        
## Summer      0.009047995 0.008312525            
## Winter      0.003447739 0.002929805 0.010647062

Next let’s test these Fst results to see if they are significant

To accomplish this we’ll randomize over populations 1000 fold and then conduct a monte-carlo test. Note this approach is very computationally intensive, should find another one if we really want a significance test on Fst, skipped for now.

mat.obs <- pairwise.WCfst(as.data.frame(cbind((genind_2.0$pop),genind_2.0$tab)))

NBPERM <- 100 # this is the number of permutations used for the p-values; for a publication at least 999 would be required.
mat.perm <- lapply(1:NBPERM, function(i) pairwise.WCfst(as.data.frame(cbind(sample(genind_2.0$pop, size = 882),genind_2.0$tab))))


allTests <- list()
 for(i in 1:(nrow(mat.obs)-1)){
   for(j in 2:nrow(mat.obs)){
   allTests[[paste(rownames(mat.obs)[i],rownames(mat.obs)[j],sep="-")]] <- as.randtest(na.omit(sapply(1:NBPERM, function(k) mat.perm[[k]][i,j])), mat.obs[i,j], alter="greater")
   }
}

Fst Results Summary

Moderate differentiation between most pairwise pop comps (not fall and halfpounder), but as suspected, this is driven mostly by the run timing markers. When examining only neutral annotated loci, differentiation is extremely low (less than 1%). Also of note here is that both the neutral and total estimates of differentiation between early and late summer (fall) populations (4.2% and 0.9%, respectively) is quite high. Mean Fst over all loci is about half the level of differentation between early summer and winter populations (9.7%) and nearly the same when examining only neutral loci (1.1%). Indeed, the late summer run fish demonstrate lower differentiation from winter run than early summer run fish at both neutral and all loci.

Population Structure

From the Fst estimation in the section above we have our first ideas about population structure: (a) fall run and half pounder fish demonstrate little to no differentiation, and this group is less differentiated from winter run than summer run fish and (b) some evidence of structure WITHIN halfpounder and fall runs. In this section we will conduct several analyses to uncover population structure in greater detail.

STRUCTURE

Here we use a bayesian, model based clustering algorithm (STRUCTURE) to infer population structure and estimate admixture proportions of individual samples.

First we need to get our dataset ready for structure: remove linked loci, convert to structure format.

# first lets calculate LD (dartR has a great (fast) ld estimator that works right on genind files, so let's use this)
ldreport <- dartR::gl.report.ld(genind_2.0, name = NULL, save = FALSE, nchunks = 2, ncores = 3, chunkname = NULL)
## Start to calculate LD for all pairs of loci...
## Using 3 cores in 2  chunks.
## Depending on the number of loci this may take a while...
## nchunks specifies the number of steps in the progress bar and the number of intermediate saves, but slows the computation a bit. nchunks = 1 is fastest.
## Seperate all 882 loci...
## Generate all possible pairs: 62835 ...
## Calculate LD for all pairs...
## 
  |                                                                            
  |===================================                                   |  50%
  |                                                                            
  |======================================================================| 100%
## # Simulations: 62835 . Took 58 seconds.

We’ll prune (keep one) the dataset of any locus-pairs with r2 > 0.2, then convert to STRUCTURE format

unlinked_genind <- genind_2.0[loc=-unique(ldreport[ldreport$R2>0.2,]$loc2)]
rm(ldreport)
#note just sort of crashed through this with a text editor, not easily logged, but the general idea was transpose the data, split columns (diploid to dual haploid) then convert data to integers
df <- genind2df(unlinked_genind)
df <- as.data.frame(t(df))
write_tsv(df, "genotype_data/all.str.tmp")
df <- read_tsv("genotype_data/all.str.tmp", col_names = FALSE)
df <- t(df)
write_tsv(as.data.frame(df), "genotype_data/all.str", col_names = FALSE)

This removed 25 highly linked SNPs from the dataset.

Run Log

Structure was run in a GUI outside this computation notebook’s environment.
admixture model: admixture, with correlated allele frequency
burnin/mcmc: ran with k=1-3 for 100k iteration to check for convergence of alpha, strong convergence after a few hundred burn-in iterations, used 10,000/20,000 for final runs
replicates: did 10 replicates for k=1-6

Best K was chosen by the evanno method, and estimated in structure harvester.
Replicate results within each K were clumpp’d using the clumpak algorithm on the clumpak webserver

Results

Here we visualize the structure results of the clumpp’d results of all K values

Best K

Best K was 2 according to the Evanno (delta K) method, however, it’s important to remember the bias toward k=2 when differentiation is low or there is no population structure using this method. Delta k literally cannot evaluate K=1. (see note below)

note: delta K suggests best k is two, however, particularly at low differentiation, the delta k method is biased towards k=2 (cullingham 2020), seems like a good place to remind myself that k is a model that doesn’t always fully catpure biological reality and comparing results across different levels of k can proide interesting insights, even when best k is unknown, particularly in the case of low differentiation. Also a note that it might be good to review the literature again on bayesian clustering methods for low levels of differentation, e.g. that gagnaire paper and latch 2006.

delta K plot

delta K plot

Structure Plots

Next let’s take the clumppd results and make some publication-ready figures. First plot is downsampled to 42 samples per “population”, second plot is full dataset.

#import clump results into dataframes
# results are in files k*/majorcluster/clumppfiles/clumpindoutput
# took these files and captured the relevant data with a text editor (original input files are a mess with multiple field separators) and saved to new files
k2 <- read_tsv("./structure/halfpounder/clumpak/formatted_results/k2.txt")
k3 <- read_tsv("./structure/halfpounder/clumpak/formatted_results/k3.txt")
k4 <- read_tsv("./structure/halfpounder/clumpak/formatted_results/k4.txt")
k5 <- read_tsv("./structure/halfpounder/clumpak/formatted_results/k5.txt")


plot_data <- k2 %>% 
  rownames_to_column(var="id") %>% 
  group_by(pop) %>%
  sample_n(40) %>% # sample the half_pounder and fall fish to smaller size for plot
  ungroup() %>%
  gather('cluster', 'prob', clust1:clust2) %>%
  group_by(id) %>% 
  mutate(likely_assignment = cluster[which.max(prob)],
         assingment_prob = max(prob)) %>% 
  arrange(likely_assignment, desc(assingment_prob)) %>% 
  ungroup()

a <- ggplot(plot_data, aes(id, prob, fill = cluster)) +
  geom_col(width=1.0) +
  facet_grid(~pop, scales = 'free', space = 'free', switch = "x") +
  scale_y_continuous(expand = c(0, 0)) +
  scale_x_discrete(expand = expand_scale(add = 1)) +
  theme(panel.spacing=unit(0.1, "lines"), axis.title.x=element_blank(), axis.text.x=element_blank(), axis.ticks.x=element_blank(), legend.position = "none", axis.title.y=element_blank(), strip.background = element_rect(color = "white", fill = "white"), strip.text.x = element_blank()) +
  scale_fill_manual(values = viridisLite::viridis(2))

plot_data <- k3 %>% 
  rownames_to_column(var="id") %>% 
  group_by(pop) %>%
  sample_n(40) %>% # sample the half_pounder and fall fish to smaller size for plot
  ungroup() %>%
  gather('cluster', 'prob', clust1:clust3) %>%
  group_by(id) %>% 
  mutate(likely_assignment = cluster[which.max(prob)],
         assingment_prob = max(prob)) %>% 
  arrange(likely_assignment, desc(assingment_prob)) %>% 
  ungroup()

b <- ggplot(plot_data, aes(id, prob, fill = cluster)) +
  geom_col(width=1.0) +
  facet_grid(~pop, scales = 'free', space = 'free', switch = "x") +
  scale_y_continuous(expand = c(0, 0)) +
  scale_x_discrete(expand = expand_scale(add = 1)) +
  theme(panel.spacing=unit(0.1, "lines"), axis.title.x=element_blank(), axis.text.x=element_blank(), axis.ticks.x=element_blank(), legend.position = "none", axis.title.y=element_blank(), strip.background = element_rect(color = "white", fill = "white"), strip.text.x = element_blank()) +
  scale_fill_manual(values = viridisLite::viridis(3))

plot_data <- k4 %>% 
  rownames_to_column(var="id") %>% 
  group_by(pop) %>%
  sample_n(40) %>% # sample the half_pounder and fall fish to smaller size for plot
  ungroup() %>%
  gather('cluster', 'prob', clust1:clust4) %>%
  group_by(id) %>% 
  mutate(likely_assignment = cluster[which.max(prob)],
         assingment_prob = max(prob)) %>% 
  arrange(likely_assignment, desc(assingment_prob)) %>% 
  ungroup()

c <- ggplot(plot_data, aes(id, prob, fill = cluster)) +
  geom_col(width=1.0) +
  facet_grid(~pop, scales = 'free', space = 'free', switch = "x") +
  scale_y_continuous(expand = c(0, 0)) +
  scale_x_discrete(expand = expand_scale(add = 1)) +
  theme(panel.spacing=unit(0.1, "lines"), axis.title.x=element_blank(), axis.text.x=element_blank(), axis.ticks.x=element_blank(), legend.position = "none", axis.title.y=element_blank(), strip.background = element_rect(color = "white", fill = "white"), strip.text.x = element_blank()) +
  scale_fill_manual(values = viridisLite::viridis(4))

plot_data <- k5 %>% 
  rownames_to_column(var="id") %>% 
  group_by(pop) %>%
  sample_n(40) %>% # sample the half_pounder and fall fish to smaller size for plot
  ungroup() %>%
  gather('cluster', 'prob', clust1:clust5) %>%
  group_by(id) %>% 
  mutate(likely_assignment = cluster[which.max(prob)],
         assingment_prob = max(prob)) %>% 
  arrange(likely_assignment, desc(assingment_prob)) %>% 
  ungroup()

d <- ggplot(plot_data, aes(id, prob, fill = cluster)) +
  geom_col(width=1.0) +
  facet_grid(~pop, scales = 'free', space = 'free', switch = "x", labeller = labeller( pop =  c("fall" ="Late-Summer",  "half"= "Half-Pounder", "summer" =  "Early-Summer", "winter" =  "Winter"))) +
  scale_y_continuous(expand = c(0, 0)) +
  scale_x_discrete(expand = expand_scale(add = 1)) +
  theme(panel.spacing=unit(0.1, "lines"), axis.title.x=element_blank(), axis.text.x=element_blank(), axis.ticks.x=element_blank(), legend.position = "none", axis.title.y=element_blank(), strip.background = element_rect(color = "white", fill = "white")) +
  scale_fill_manual(values = viridisLite::viridis(5))



cowplot::plot_grid(a,b,c,d, ncol=1)

plot_data <- k2 %>% 
  rownames_to_column(var="id") %>% 
 # sample the half_pounder and fall fish to smaller size for plot
  gather('cluster', 'prob', clust1:clust2) %>%
  group_by(id) %>% 
  mutate(likely_assignment = cluster[which.max(prob)],
         assingment_prob = max(prob)) %>% 
  arrange(likely_assignment, desc(assingment_prob)) %>% 
  ungroup()


a <- ggplot(plot_data, aes(id, prob, fill = cluster)) +
  geom_col(width=1.0) +
  facet_grid(~pop, scales = 'free', space = 'free', switch = "x") +
  scale_y_continuous(expand = c(0, 0)) +
  scale_x_discrete(expand = expand_scale(add = 1)) +
  theme(panel.spacing=unit(0.1, "lines"), axis.title.x=element_blank(), axis.text.x=element_blank(), axis.ticks.x=element_blank(), legend.position = "none", axis.title.y=element_blank(), strip.background = element_rect(color = "white", fill = "white"), strip.text.x = element_blank()) +
  scale_fill_manual(values = viridisLite::viridis(2))

plot_data <- k3 %>% 
  rownames_to_column(var="id") %>% 
  gather('cluster', 'prob', clust1:clust3) %>%
  group_by(id) %>% 
  mutate(likely_assignment = cluster[which.max(prob)],
         assingment_prob = max(prob)) %>% 
  arrange(likely_assignment, desc(assingment_prob)) %>% 
  ungroup()

b <- ggplot(plot_data, aes(id, prob, fill = cluster)) +
  geom_col(width=1.0) +
  facet_grid(~pop, scales = 'free', space = 'free', switch = "x") +
  scale_y_continuous(expand = c(0, 0)) +
  scale_x_discrete(expand = expand_scale(add = 1)) +
  theme(panel.spacing=unit(0.1, "lines"), axis.title.x=element_blank(), axis.text.x=element_blank(), axis.ticks.x=element_blank(), legend.position = "none", axis.title.y=element_blank(), strip.background = element_rect(color = "white", fill = "white"), strip.text.x = element_blank()) +
  scale_fill_manual(values = viridisLite::viridis(3))

plot_data <- k4 %>% 
  rownames_to_column(var="id") %>% 
  gather('cluster', 'prob', clust1:clust4) %>%
  group_by(id) %>% 
  mutate(likely_assignment = cluster[which.max(prob)],
         assingment_prob = max(prob)) %>% 
  arrange(likely_assignment, desc(assingment_prob)) %>% 
  ungroup()

c <- ggplot(plot_data, aes(id, prob, fill = cluster)) +
  geom_col(width=1.0) +
  facet_grid(~pop, scales = 'free', space = 'free', switch = "x") +
  scale_y_continuous(expand = c(0, 0)) +
  scale_x_discrete(expand = expand_scale(add = 1)) +
  theme(panel.spacing=unit(0.1, "lines"), axis.title.x=element_blank(), axis.text.x=element_blank(), axis.ticks.x=element_blank(), legend.position = "none", axis.title.y=element_blank(), strip.background = element_rect(color = "white", fill = "white"), strip.text.x = element_blank()) +
  scale_fill_manual(values = viridisLite::viridis(4))

plot_data <- k5 %>% 
  rownames_to_column(var="id") %>% 
  gather('cluster', 'prob', clust1:clust5) %>%
  group_by(id) %>% 
  mutate(likely_assignment = cluster[which.max(prob)],
         assingment_prob = max(prob)) %>% 
  arrange(likely_assignment, desc(assingment_prob)) %>% 
  ungroup()

d <- ggplot(plot_data, aes(id, prob, fill = cluster)) +
  geom_col(width=1.0) +
  facet_grid(~pop, scales = 'free', space = 'free', switch = "x", labeller = labeller( pop =  c("fall" ="Late-Summer",  "half"= "Half-Pounder", "summer" =  "Early-Summer", "winter" =  "Winter"))) +
  scale_y_continuous(expand = c(0, 0)) +
  scale_x_discrete(expand = expand_scale(add = 1)) +
  theme(panel.spacing=unit(0.1, "lines"), axis.title.x=element_blank(), axis.text.x=element_blank(), axis.ticks.x=element_blank(), legend.position = "none", axis.title.y=element_blank(), strip.background = element_rect(color = "white", fill = "white"), strip.text.x = element_text(angle = 90)) +
  scale_fill_manual(values = viridisLite::viridis(5))


cowplot::plot_grid(a,b,c,d, rel_heights = c(1,1,1,1.5) ,ncol=1)

STRUCTURE Results Summary

  1. Regardless of number of putative ancestral genetic clusters (k) modeled, there was a high degree of admixture within individuals, consistent with the observation of high gene flow among a priori assigned groupings.
  2. at K=3 and greater, there is a clear summer associated genetic cluster with highly variable amounts of introgression of this genetic cluster into fall and half pounder assigned fish, and to a lesser extent winter run fish
  3. Similar to the results from k means clustering (see dapc below), as k increases, there is decreasing shared cluster membership among winter and summer run fish, while half pounder and fall run fish continue to split evenly among clusters

PCA

Below we perform an unconstrained ordination of genetic data for both the neutral and the full datasets using PCA (implemented in ade4 and adegenet)

Neutral PCA

first we run pca only on the neutral dataset

neutral_genind <- genind_2.0[loc=neutral_loci_names]

#rename pops for mpublication quality figure
neutral_genind$pop <- as.factor(str_replace(neutral_genind$pop, "Summer", "Early-Summer"))
neutral_genind$pop <- as.factor(str_replace(neutral_genind$pop, "fall", "Late-Summer")) 
neutral_genind$pop <- as.factor(str_replace(neutral_genind$pop, "halfpounder", "Half-Pounder")) 

#reorder pop factor
neutral_genind$pop <- factor(neutral_genind$pop, levels = c("Late-Summer", "Half-Pounder", "Early-Summer", "Winter"))

#replace missing data using mean allele frequency
neutral_genind_scale <- scaleGen(neutral_genind, NA.method="mean")

# run the PCA, dudi.pca uses an interactive prompt to choose pcs to retain, we retain all in order to run some analyses on eigenvalues
pca1 <- dudi.pca(neutral_genind_scale,cent=FALSE,scale=FALSE, scannf = FALSE, nf = 332)
barplot(pca1$eig[1:332],main="PCA eigenvalues")

s.class(pca1$li, pop(neutral_genind),xax=1,yax=2, col=transp(viridisLite::viridis(4),0.7), axesell=FALSE, cstar=0, cpoint=1, grid=FALSE)
add.scatter.eig(pca1$eig[1:10],nf=3,xax=1,yax=2, posi = "bottomright" )

#publication quality figure
# pca_vals <- pca1$li[,c(1:2)]
# pca_vals <- pca_vals %>%
#   rownames_to_column(var = "Sample") %>%
#   left_join(select(genos_2.2, Sample, run)) %>%
#   mutate(run = if_else(run == "fall", "Late-Summer", run)) %>%
#   mutate(run = if_else(run == "halfpounder", "Half-Pounder", run)) %>%
#   mutate(run = if_else(run == "Summer", "Early-Summer", run))
# 
# ggplot(data = pca_vals, aes(Axis1, Axis2, color = run))+geom_point(alpha = 0.5)+stat_ellipse()+theme_classic()+scale_color_viridis_d()+xlab("Principal Component 1")+ylab("Principal Component 2")
# barplot(pca1$eig[1:10],main="PCA eigenvalues")

Let’s also make an interactive 3d plot, since the third PC is not much less informative than the second

plot_ly(x=pca1$li$Axis1, y=pca1$li$Axis2, z=pca1$li$Axis3, type="scatter3d", mode="markers", color=neutral_genind$pop, alpha = 0.8)

I’m interactive!!!

As suggested by our Fst results, there is little differentiation at neutral markers. Also of note is a halfpounder outlier along the first pc. It should not come as a surprise that pca fails to differentiate among populations given the fst. PCA should not be able to differentiate among pops when fst < 1/(sqrt(sample size X number of markers)), which in our case is about 3%, substantially greater than the fst calculated for neutral markers (patterson 2006).

Also note that the primary axis (1 / X) captures variation within fall and half pounedrs that is largely absent from winter and summer runs. Once again, another hint of population structure within halfpounders/fall run fish, but this time suggestive of neutral variation that doesn’t exist within the fall/winter runs.

Full PCA

#replace missing data using mean allele frequency
genind_scale <- scaleGen(genind_2.0, NA.method="mean")

genind_2.1 <- genind_2.0

#rename pops for mpublication quality figure
genind_2.1$pop <- as.factor(str_replace(genind_2.1$pop, "Summer", "Early-Summer"))
genind_2.1$pop <- as.factor(str_replace(genind_2.1$pop, "fall", "Late-Summer")) 
genind_2.1$pop <- as.factor(str_replace(genind_2.1$pop, "halfpounder", "Half-Pounder")) 

#reorder pop factor
genind_2.1$pop <- factor(genind_2.1$pop, levels = c("Late-Summer", "Half-Pounder", "Early-Summer", "Winter"))


# run the PCA, dudi.pca uses an interactive prompt to choose pcs to retain, we retain all in order to run some analyses on eigenvalues
pca1 <- dudi.pca(genind_scale,cent=FALSE,scale=FALSE, scannf = FALSE, nf = 332)
barplot(pca1$eig[1:332],main="PCA eigenvalues")

s.class(pca1$li, pop(genind_2.1),xax=1,yax=2, col=transp(viridisLite::viridis(4),0.7), axesell=FALSE, cstar=0, cpoint=1, grid=FALSE)
add.scatter.eig(pca1$eig[1:10],nf=3,xax=1,yax=2)

Let’s also make an interactive 3d plot, since the third PC is not much less informative than the second

plot_ly(x=pca1$li$Axis1, y=pca1$li$Axis2, z=pca1$li$Axis3, type="scatter3d", mode="markers", color=genind_2.0$pop, alpha = 0.8)

In the first three most important axes of genetic differentiation using the full dataset we observe complete overlap of fall and halfpounder fish, winter and summer fish are clearly separated, but the cloud of fall-halfpounder fish is inclusive of both of these groups. Three clusters of data are present in the data, (1) a winter-like cluster also including some fall run and half pounders, (2) a summer-like cluster also including some fall and half pounder fush and (3) a uniquely half-pounder fall run cluster positioned intermediate between the two.

Also, just like the neutral PCA, this captures some variation unique to fall/halfpounders. This time this variance is pushed back to the third PC, instead of the first.

DAPC

Here we pull a little harder to find some population structure. We conduct a discriminant analysis on PCA transformed genetic variation (DAPC).

k means

First we will use k-means clustering to infer the number of genetic clusters in the dataset without applying a priori assumptions about the number of clusters or population assignments in the sample datset. Then we will check if these groups match well with assigned pops before proceeding to the DAPC.

Below are plots of bayesian information criterion across different numbers of clusters and cluster membership across a range of most-likely clusters:

#k means clustering, keep a lot (330 pcs) (kmeans won't overfit with two many pcs)
#note the number of clusters was chosen interactively, the code below executes the clustering using the best k
clusts <- find.clusters(genind_2.0, n.pca = 330, choose.n.clust = FALSE)

#plot BIC
bic <- as.data.frame(cbind(c(1:length(clusts$Kstat)), clusts$Kstat))
ggplot(bic)+geom_point(aes(x=V1, y=V2))+geom_line(aes(x=V1, y=V2))+theme_classic()+xlim(c(0, 50))+xlab("k")+ylab("BIC")
BIC for k-means clustering across different numbers of genetic clusters

BIC for k-means clustering across different numbers of genetic clusters

clusts <- find.clusters(genind_2.0, n.pca = 330, n.clust = 2)
kable(table(pop(genind_2.0), clusts$grp),caption = "a priori population vs genetic cluster" )
a priori population vs genetic cluster
1 2
fall 58 99
halfpounder 307 336
Summer 0 42
Winter 40 0
clusts <- find.clusters(genind_2.0, n.pca = 330, n.clust = 3)
kable(table(pop(genind_2.0), clusts$grp),caption = "a priori population vs genetic cluster" ) 
a priori population vs genetic cluster
1 2 3
fall 49 42 66
halfpounder 223 251 169
Summer 42 0 0
Winter 0 33 7
clusts <- find.clusters(genind_2.0, n.pca = 330, n.clust = 4)
kable(table(pop(genind_2.0), clusts$grp), caption = "a priori population vs genetic cluster" )
a priori population vs genetic cluster
1 2 3 4
fall 54 6 47 50
halfpounder 139 100 194 210
Summer 14 0 0 28
Winter 0 15 25 0

The model fit is best at k=5, however, only minor improvements in fit are beyond k=4. BIC tends to decrease until best fit only in a perfect island model, in practice, best K is often found at the lowest K that is a substantial improvement from the last. Here we use k=4, but when other k were used, generally the same result: Never any overlap between winter and summer run fish, fall and halfpounders always distributed among the 4 groups.

This result fits with our previous findings, there is structure and high diversity within fall and halfpounders (see heterozygosity section) and substantial overlap between fall-halfpounders and both winter and summer fish, but not between winter and summer fish.

This creates a complex scenario for conducting the DAPC, should we use a priori assigned populations as the basis of our DA? Simulation stuides (eg miller et al 2020) suggests that at low fst, the ability of k means clustering to succesfully recapitulate real genetic clusters is low, however, DA on biologically inaccurate grouping variables is not meaningful… Let’s do both for now as they are both informative in different ways.

DAPC de novo

#first optimize the PCs retained based on the a.score, so run dapc on the full pcs
#invisible(dapc_full_denovo <- dapc(genind_2.0, clusts$grp, n.pca = 330, n.da = 3 ))
#optim.a.score(dapc_full_denovo)
#nice now we use the optimized a score to run our dapc for real
dapc_full_denovo <- dapc(genind_2.0, clusts$grp, n.pca = 57, n.da = 3 )

plot_data <- as.data.frame(cbind(dapc_full_denovo$ind.coord, genind_2.0$pop, clusts$grp))
colnames(plot_data) <- c("LD1", "LD2", "LD3", "pop", "grp")
plot_data$pop <- as.factor(plot_data$pop)
plot_data$grp <- as.factor(plot_data$grp)

ggplot(data=plot_data)+geom_point(aes(x=LD1, y=LD2, color=pop))+scale_color_viridis_d(labels = c("Late-Summer", "Half-Pounder", "Early-Summer", "Winter"))+theme_classic()+ggtitle("DAPC by k-means cluster, color by a priori population")+guides(color=guide_legend("Run-Type")) 

ggplot(data=plot_data)+geom_point(aes(x=LD1, y=LD2, color=grp))+scale_color_viridis_d()+theme_classic()+ggtitle("DAPC by cluster, colored by k-means cluster")+guides(color=guide_legend("K-means cluster")) 

marker_loadings1 <- loadingplot(dapc_full_denovo$var.contr, axis=1,thres=.005, lab.jitter=1, main = "DA axis 1 loading plot")

markers1 <- unique(substr(names(marker_loadings1$var.values),1,nchar(names(marker_loadings1$var.values))-2))

marker_loadings2 <- loadingplot(dapc_full_denovo$var.contr, axis=2,thres=.02, lab.jitter=1, main = "DA axis 2 loading plot")

markers2 <- unique(substr(names(marker_loadings2$var.values),1,nchar(names(marker_loadings2$var.values))-2))

kable(marker_mapping2 %>%
  filter(marker %in% markers1 ) %>%
  select(marker, `Presumed Type`), caption = "markers heavily loaded into first discriminant axis")
markers heavily loaded into first discriminant axis
marker Presumed Type
Chr28_11667578 Adaptive. Run Timing
Omy_RAD38269-10 Neutral
OmyR14589 Adaptive. Residency vs anadromy
Omy_bcAKala-380rd Neutral
OmyR24370 Adaptive. Residency vs anadromy
OmyR40252 Adaptive. Residency vs anadromy
OmyR19198 Adaptive. Residency vs anadromy
OmyR33562 Adaptive. Residency vs anadromy
Omy_SECC22b-88 Neutral. Possible linkage
Omy_GREB1_05 Adaptive. Run timing
Chr28_11625241 Adaptive. Run Timing
Chr28_11658853 Adaptive. Run Timing
Omy_RAD47080-54 Adaptive. Run Timing
Omy_RAD15709-53 Adaptive. Natal site Isothermality. Basin-wide, run-time - related
Chr28_11671116 Adaptive. Run Timing
Chr28_11676622 Adaptive. Run Timing
Chr28_11683204 Adaptive. Run Timing
Chr28_11702210 Adaptive. Run Timing
kable(marker_mapping2 %>%
  filter(marker %in% markers2 ) %>%
  select(marker, `Presumed Type`), caption = "markers heavily loaded into second discriminant axis")
markers heavily loaded into second discriminant axis
marker Presumed Type
OmyR14589 Adaptive. Residency vs anadromy
Omy_bcAKala-380rd Neutral
OmyR24370 Adaptive. Residency vs anadromy
OmyR40252 Adaptive. Residency vs anadromy
OmyR19198 Adaptive. Residency vs anadromy
OmyR33562 Adaptive. Residency vs anadromy

As suspected, given the the fact that fall and halfpounder fish do not fall into a single genetic cluster, de novo assigned genetic cluters do not do a great job of differentiating among a priori assigned run-timing groups. When breaking the fish into four groups these groups are mostly defined by covariation at ~20 markers including run timing markers, neutral markers, and residency markers. Note that the loadings for the run-timing markers are ~2 fold greater than other markers included in the table above and that this cutoff is arbitrary.

DAPC a priori

A note on the best number of pcs
Given that (a) we are using a priori assigned groups to assess differences among said groups and (b) the number of predictors is only slightly less than the number of object (n = 882 and p= ~350), we have to be really cautious about overfitting here, so we’re going to run both cross-fold validation and the “a.score” procedure to determine the correct number of pcs to retain, then we will retain pcs according to which optimization strategy suggests the lower number of pcs. One issue here is that previous results suggest that at least one and potentially two (fall and half pounders) of our a priori groups is a synthetic (i.e. non exclusively genetic) assemblage of individuals. If this is the case, even using approaches such as the a score or cross validation will lead to overfitting w respect to the underlying biology and our choice of n.pc should attempt to limit overfitting by choosing the minimum PCs possible that still strongly discriminate.

The best pca according to optim.a.score was 112, but there was only extremely slight improvement at each pc beyond 12, where we observed a local maximum. Best pca according to cross validation using overall group assignment success was 16, however cross validation using mean group assignment per group was 121. Similar to the a.score approach, there is not a strong “arc” pattern in the latter as one might suspect for a cross validation procedure, instead fit improves rapidly after the first few pcs, declines, then and continues a gradual improvement. We chose the number of PCs that correspond to this first peak in both the xval and optim.a.score procedure: 12. However, it should be noted that running DAPC with widely varying PCs did not qualitatively change the pattern of results. The degree of discrimination increased with increasing PCs and loadings on individual markers varied, but not the qualitative relationships among a priori assigned populations.

Now the DAPC

#first optimize the PCs retained based on the a.score, so run dapc on the full pcs
#invisible(dapc_full_apriori <- dapc(genind_2.0, n.pca = 330, n.da = 3 ))
#to speed this up in future knits, we ran optim.a.score once, but not in the notebook
#optim.a.score(dapc_full_apriori, smart=FALSE)
#nice now we use the optimized a score to run our dapc for real
#mat <- as.matrix(scaleGen(genind_2.0, NA.method="mean", scale=FALSE, center=FALSE))
#xpop <- pop(genind_2.0)
#xval <- xvalDapc(mat, xpop, n.pca.max = 200, training.set = 0.9, result = "overall", center = TRUE, scale = FALSE, n.pca = seq(1,300, length.out = 60), n.rep = 50, xval.plot = TRUE)

invisible(dapc_full_apriori <- dapc(genind_2.0, n.pca = 12, n.da = 3 ))


plot_data <- as.data.frame(dapc_full_apriori$ind.coord)
plot_data$pop <- as.character(genind_2.0$pop)


ggplot(data=plot_data)+geom_point(aes(x=LD1, y=LD2, color = pop), alpha = 0.5, size = 2)+theme_classic()+scale_color_viridis_d()+stat_ellipse(aes(x=LD1, y=LD2, color = pop))

scatter(dapc_full_apriori)

marker_loadings1 <- loadingplot(dapc_full_apriori$var.contr, axis=1,thres=.01, lab.jitter=1, main = "laoding plot for DA 1")

markers1 <- unique(substr(names(marker_loadings1$var.values),1,nchar(names(marker_loadings1$var.values))-2))

marker_loadings2 <- loadingplot(dapc_full_apriori$var.contr, axis=2,thres=.00713, lab.jitter=1, main = "laoding plot for DA 2")

markers2 <- unique(substr(names(marker_loadings2$var.values),1,nchar(names(marker_loadings2$var.values))-2))

kable(marker_mapping2 %>%
  filter(marker %in% markers1 ) %>%
  select(marker, `Presumed Type`), caption = "Markers that load strongly onto first DA")
Markers that load strongly onto first DA
marker Presumed Type
Chr28_11667578 Adaptive. Run Timing
Omy_128996-481 Neutral
Omy_GREB1_05 Adaptive. Run timing
Chr28_11625241 Adaptive. Run Timing
Chr28_11658853 Adaptive. Run Timing
Omy_RAD47080-54 Adaptive. Run Timing
Omy_RAD15709-53 Adaptive. Natal site Isothermality. Basin-wide, run-time - related
Chr28_11671116 Adaptive. Run Timing
Chr28_11676622 Adaptive. Run Timing
Chr28_11683204 Adaptive. Run Timing
Chr28_11702210 Adaptive. Run Timing
kable(marker_mapping2 %>%
  filter(marker %in% markers2 ) %>%
  select(marker, `Presumed Type`, chr, start), caption = "Markers that load strongly onto the second DA")
Markers that load strongly onto the second DA
marker Presumed Type chr start
Omy_112301-202 Neutral 3 37590514
Omy_103705-558 Neutral 17 7065921
Omy_RAD42465-32 Adaptive. Maxiumum air temperature (warmest quarter). Basin-wide, top-outlier 18 25034263
Omy_RAD50632-21 Adaptive. Basin-wide, top-outlier 1 38986481
Omy_G3PD_2-371 Neutral 2 6083880
OmyR24370 Adaptive. Residency vs anadromy 5 28579317
OmyR19198 Adaptive. Residency vs anadromy 5 34973454
OmyR33562 Adaptive. Residency vs anadromy 5 47337494
Omy_SECC22b-88 Neutral. Possible linkage 5 61828845
Omy_vamp5-303 Neutral 6 33625089
Omy_RAD76570-62 Adaptive. Minimum annual precipitation. Basin-wide, precipitation-related; 12 56297712
Omy_101993-189 Neutral 17 21491237
Omy_RAD43612-42 Neutral 18 29118747
Omy_LDHB-2_i6 Neutral 21 24130489
# for ms figures

# ggplot(data=plot_data)+geom_point(aes(x=LD1, y=LD2, color = pop), alpha = 0.5, size = 2)+theme_classic()+scale_color_viridis_d(name = "Run",labels = c("Late-Summer", "Half-Pounder", "Early-Summer", "Winter" ))+stat_ellipse(aes(x=LD1, y=LD2, color = pop))

#for ms tables with loading value
#marker_loadings2a <- loadingplot(dapc_full_apriori$var.contr, axis=2,thres=.00000001, lab.jitter=1, main = "laoding plot for DA 2")
#write.table(as.data.frame(marker_loadings2a$var.values), "load.txt", quote = FALSE, sep = "\t" )
#write.table(select(marker_mapping2, marker, `Presumed Type`, chr), "map.txt", quote=FALSE, sep = "\t")  

DAPC largely discriminates among groups along the first discriminant axis. Fall and half pounder fish demonstrate little to no differentiation along any DA. The first DA strongly separates winter and summer run fish, and is driven by by 11 markers including run timing associated markers and a neutral marker (Omy_128996-481). The second da captures much less variation among groups (78% vs 17%), and is driven largely by 5 markers neutral markers and to a lesser extent 19 additional markers including additional neutral markers and several associated with residency/anadromy and other adapative traits. Interstingly one of the “neutral” annotated markers (Omy_LDHB-2…) have been associated with residency vs anadromy in an association study in the klickitat river (doi.org/10.1080/00028487.2011.588131).

DAPC Classification

Next we use this a priori DAPC to attempt to classify half pounders into early-summer-like, winter-like or intermediate groups. Note: did this two ways, in the first we run a true classfiication scheme: exclude the fall and half pounder fish, run create a classifier then predict the fall and half pounder membership based on this classifier, however if fall run fish are a real group, the classifier should include all three or four groups and therefore would default back to the original DAPC. I left the true classifier (just winter vs summer) in the notebook, but I think we should just present the full a priori DAPC results. In any case, the effect on the end result is marginal

classifier <- dapc(genind_2.0[pop = c("Winter", "Summer")], n.da = 2, n.pca = 1) #optim a score is 1
pred.half<- predict.dapc(classifier, newdata=genind_2.0[pop=c("halfpounder" , "fall")])

preds <- as.data.frame(cbind(pred.half$posterior, pred.half$ind.scores))

#hmm pop info is not moving over easilt left just merge it from another df
colnames(fstat)[1] <- "pop"
pops_info <- fstat %>%
  rownames_to_column(var="sample") %>%
  select(sample, pop)
preds <- preds %>%
  rownames_to_column(var="sample") %>%
  left_join(pops_info)

ld1 <- as.data.frame(classifier$ind.coord) %>%
   bind_rows(as.data.frame(pred.half$ind.scores)) %>%
  rownames_to_column(var="sample") %>%
  left_join(pops_info)

ld1_2 <- as.data.frame(dapc_full_apriori$ind.coord) %>%
  rownames_to_column(var="sample") %>%
  left_join(pops_info)
    
#plot
ggplot(data=ld1)+geom_density(aes(LD1, fill=pop, color=pop), alpha = 0.2)+scale_color_viridis_d()+scale_fill_viridis_d()+theme_classic()+ggtitle("True Classifier")

ggplot(data=ld1_2)+geom_density(aes(LD1, fill=pop, color=pop), alpha = 0.2)+scale_color_viridis_d(labels = c("Late-Summer", "Half-Pounder", "Early-Summer", "Winter"))+scale_fill_viridis_d(labels = c("Late-Summer", "Half-Pounder", "Early-Summer", "Winter"))+theme_classic()+ggtitle("A priori DAPC results")+labs(color = "Run", fill = "Run")

# assignments using classifier
CIs <- ld1 %>%
  group_by(pop) %>%
  summarise(loCI = quantile(LD1, probs = 0.025),
            hiCI = quantile(LD1, probs = 0.975))

#number of half pounders that fall in the 95% credible interval for winter fish assignment
kable(ld1 %>%
  filter(pop == "halfpounder") %>%
  summarise(winter_assigned = sum((LD1 < CIs$hiCI[4])), summer_assigned = sum(( LD1 > CIs$loCI[3])), unassigned = sum((LD1 <= CIs$loCI[3] & LD1 >= CIs$hiCI[4])) ), caption = "True Classifier - Half Pounders")
True Classifier - Half Pounders
winter_assigned summer_assigned unassigned
143 62 438
# assignments using original DAPC
CIs <- ld1_2 %>%
  group_by(pop) %>%
  summarise(loCI = quantile(LD1, probs = 0.025),
            hiCI = quantile(LD1, probs = 0.975))

#number of half pounders that fall in the 95% credible interval for winter fish assignment
kable(ld1_2 %>%
  filter(pop == "halfpounder") %>%
  summarise(winter_assigned = sum((LD1 < CIs$hiCI[4])), summer_assigned = sum(( LD1 > CIs$loCI[3])), unassigned = sum((LD1 <= CIs$loCI[3] & LD1 >= CIs$hiCI[4])) ), caption = "A priori DAPC - Half Pounders")
A priori DAPC - Half Pounders
winter_assigned summer_assigned unassigned
348 77 218
#same as above but for fall fish

CIs <- ld1 %>%
  group_by(pop) %>%
  summarise(loCI = quantile(LD1, probs = 0.025),
            hiCI = quantile(LD1, probs = 0.975))

#number of half pounders that fall in the 95% credible interval for winter fish assignment
kable(ld1 %>%
  filter(pop == "fall") %>%
  summarise(winter_assigned = sum((LD1 < CIs$hiCI[4])), summer_assigned = sum((LD1 > CIs$loCI[3])), unassigned = sum((LD1 <= CIs$loCI[3] & LD1 >= CIs$hiCI[4])) ), caption = "True Classifier - Fall Run")
True Classifier - Fall Run
winter_assigned summer_assigned unassigned
16 6 135
# assignments using original DAPC
CIs <- ld1_2 %>%
  group_by(pop) %>%
  summarise(loCI = quantile(LD1, probs = 0.025),
            hiCI = quantile(LD1, probs = 0.975))

#number of half pounders that fall in the 95% credible interval for winter fish assignment
kable(ld1_2 %>%
  filter(pop == "fall") %>%
  summarise(winter_assigned = sum((LD1 < CIs$hiCI[4])), summer_assigned = sum((LD1 > CIs$loCI[3])), unassigned = sum((LD1 <= CIs$loCI[3] & LD1 >= CIs$hiCI[4])) ), caption = "A priori DAPC - Fall Run")
A priori DAPC - Fall Run
winter_assigned summer_assigned unassigned
78 8 71

When we use the original DAPC (not the classifier, still need to make a decision here) we attempt to classify fish into winter or summer like using the first DA which accounts for ~80% of the among group variance and is driven almost exclusively by known run timing associated markers. We then draw 95% credible intervals for winter and summer run and count how many fall and half pounder fish fall beyond the lower bounds for these intervals (i.e. at least as winter-like or summer-like as the 95% CI). For fall fish, the majority are winter assigned, nearly as many are unassigned and a small portion are summer assignmed. The same pattern is observed among the half pounders, but ratio number of unassigned fish is somewhat lower.

Neutral DAPC

There is evidence of weak differentiation among populations at neutral markers from the estimation of Fst. Here we will attempt to magnify these differences using a discirminat analaysis among a priori assigned run types.

#neutraldapc <- dapc(neutral_genind, n.pca = 226, n.da= 3 )
#optim.a.score(neutraldapc)
#nice now we use the optimized a score to run our dapc for real
invisible(neutraldapc <- dapc(neutral_genind, n.pca = 60, n.da = 3 ))


plot_data <- as.data.frame(neutraldapc$ind.coord)
plot_data$pop <- as.character(genind_2.0$pop)

#scatter.dapc(neutraldapc)

plot_ly(x=plot_data$LD1, y=plot_data$LD2, z=plot_data$LD3, type="scatter3d", mode="markers", color=genind_2.0$pop, alpha = 0.8)
#pub quality figure
ggplot(data=plot_data)+geom_point(aes(x=LD1, y=LD2, color = pop), alpha = 0.5, size = 2)+theme_classic()+scale_color_viridis_d(name = "Run",labels = c("Late-Summer", "Half-Pounder", "Early-Summer", "Winter" ))+stat_ellipse(aes(x=LD1, y=LD2, color = pop))

DA axis 1 separates summer from half pounder, fall and winter. DA 2 separates summer from the rest. DA3 also separates summer from the rest. Interestingly this is the same pattern as obsevred using the full dataset, despite the fact that the first DA was driven almost exclusively by run timing markers. This suggests either LD in our dataset between run timing and neutral markers, or that restricted gene flow tied to run timing leads differentiation at neutral markers. In any case it follows with the neutral Fst a that half pounders and fall run fish are more similar to winter run than summer run across the genome.

DAPC Results Summary

K means clustering always assigned half pounders and fall run fish to all genetic clusters, while always segregating winter and summer run fish. This suggests there is substantial structure with fall and half pounder fish. Similarly, when we attempt to discriminate among a priori assigned groups, the genetic variation contained within adults that make a fall run, as well as half-pounders is nearly entirely inclusive of winter run fish, and partially inclusive of summer run fish, however, winter and summer run fish are well discriminated.

In other words:
(a) winter and summer run fish can clearly be distinguished from one another using the genetic variation captured by our GTseq panel. Genetic variation among these two groups is driven by markers with known run-timing associations (b) fall run and half pounder fish can not be discriminated from each other using the genetic information we half available
(c) fall run and half pounder fish demonstrate substantial within-“population” structure, and include fish that demonstrate a high degree of genetic similarity with both winter run and summer run fish, hwoever, most individuals demonstrate intermediate genetic charactistics. interestingly,this within population structure is driven neutral markers and residency associated markers

Run Timing Markers

Let’s look specifically at run-timing associated markers

marker_mapping <- readxl::read_xlsx("./metadata/final_mapping_results.xlsx", sheet = 1)
run_timing_loci_names <- marker_mapping %>%
  filter(chr == "28" | CRITFC_chromosome == "28") %>%
  filter(str_detect(`Presumed Type`, 'run|Run')) %>%
  select(marker)

#different naming convention, lets fix
run_timing_loci_names <- str_replace(run_timing_loci_names$marker, "Omy28", "Chr28")

#run_timing_loci <- genos_2.2 %>%
#  select(Sample, Date, run, year, one_of(run_timing_loci_names))

#genind_pop <- seppop(genind_2.0)
#run_timing_fall <- genind_pop$fall[loc=run_timing_loci_names]
#run_timing_half <- genind_pop$halfpounder[loc=run_timing_loci_names]
#run_timing_winter <- genind_pop$Winter[loc=run_timing_loci_names]
#run_timing_summer <- genind_pop$Summer[loc=run_timing_loci_names]

Diagnostic Markers

Are any markers diagnostic (fixed or nearly fixed within a run-timing group)? Below we plot:
(a) a full heatmap of allele frequencies
(b) heatmap just at near diagnostic markers (0.9 vs 0.1 across winter and summer runs)
(c) a heatmap of all markers with run-timing associations

#because adegenet can be so difficult to use, let take advantage of popgenreport to collect our summary data
all_counts <- allele.dist(genind_2.0, mk.figures = FALSE)$count
#make into a dataframe
all_counts <- as.data.frame(do.call(rbind, all_counts))
colnames(all_counts) <- c("fall_count", "half_count", "summer_count", "winter_count")
all_counts$sum <- rowSums(all_counts, na.rm = TRUE)

all_freqs <- allele.dist(genind_2.0, mk.figures = FALSE)$frequency
#make into a dataframe
all_freqs <- as.data.frame(do.call(rbind, all_freqs))

all_freqs <- as.data.frame(cbind(all_freqs, all_counts))

##### get only minor allele
all_freqs$marker <- genind_2.0$loc.fac

#now group by marker and keep the minor allele, then convert counts to 
all_maf <- all_freqs %>%
  group_by(marker) %>%
  slice_min(sum) %>%
  replace(., is.na(.), 0)

#filter all maf to include only diagnostic or near diagnostic (0.95) markers between winter and summer runs
diagmaf <- all_maf %>%
  filter((Summer > 0.9 & Winter < 0.1) | (Summer < 0.1 & Winter > 0.9))

# plot big heatmap
tmat <- t(as.matrix(all_maf[,1:4]))
colnames(tmat) <- all_maf$marker
pheatmap(tmat, show_colnames  = F)

#plot as a heatmap
col_pal<- colorRampPalette(c("red", "white", "blue"))(256)
tmat <- t(as.matrix(diagmaf[,1:4]))
colnames(tmat) <- diagmaf$marker
pheatmap(tmat, cluster_cols = FALSE)

#because adegenet can be so difficult to use, let take advantage of popgenreport to collect our summary data
run_timing_counts <- allele.dist(genind_2.0[loc=run_timing_loci_names], mk.figures = FALSE)$count
#make into a dataframe
run_timing_counts <- as.data.frame(do.call(rbind, run_timing_counts))
colnames(run_timing_counts) <- c("fall_count", "half_count", "summer_count", "winter_count")
run_timing_counts$sum <- rowSums(run_timing_counts, na.rm = TRUE)

run_timing_freqs <- allele.dist(genind_2.0[loc=run_timing_loci_names], mk.figures = FALSE)$frequency
#make into a dataframe
run_timing_freqs <- as.data.frame(do.call(rbind, run_timing_freqs))

run_timing_freqs <- as.data.frame(cbind(run_timing_freqs, run_timing_counts))

##### get only minor allele
#then put the marker names in the same order as the allele.dist (exlude the missing marker first)
run_timing_loci_names <- run_timing_loci_names[run_timing_loci_names != "Omy_RAD52458-17"]
run_timing_freqs$marker <- rep(run_timing_loci_names[order(run_timing_loci_names)], each = 2)
#now group by marker and keep the minor allele, then convert counts to 
run_timing_maf <- run_timing_freqs %>%
  group_by(marker) %>%
  slice_min(sum) %>%
  replace(., is.na(.), 0)

#plot as a heatmap
col_pal<- colorRampPalette(c("red", "white", "blue"))(256)
tmat <- t(as.matrix(run_timing_maf[,1:4]))
colnames(tmat) <- run_timing_maf$marker
pheatmap(tmat, cluster_cols = FALSE)

#repolarize greb05 and 7080-54
repol <- run_timing_maf %>%
  mutate(fall = ifelse(marker == "Omy_GREB1_05", (1 - fall), fall)) %>%
  mutate(halfpounder = ifelse(marker == "Omy_GREB1_05", (1 - halfpounder), halfpounder)) %>%
  mutate(Summer = ifelse(marker == "Omy_GREB1_05", (1 - Summer), Summer)) %>%
  mutate(Winter = ifelse(marker == "Omy_GREB1_05", (1 - Winter), Winter)) %>%
  mutate(fall = ifelse(marker == "Omy_RAD47080-54", (1 - fall), fall)) %>%
  mutate(halfpounder = ifelse(marker == "Omy_RAD47080-54", (1 - halfpounder), halfpounder)) %>%
  mutate(Summer = ifelse(marker == "Omy_RAD47080-54", (1 - Summer), Summer)) %>%
  mutate(Winter = ifelse(marker == "Omy_RAD47080-54", (1 - Winter), Winter))

repol <- repol %>%
  select(fall, halfpounder, Winter, Summer, marker) %>%
  rename("Late-Summer" = fall, "Half-Pounder" = halfpounder, "Early-Summer" = Summer)

col_pal<- colorRampPalette(c("red", "white", "blue"))(256)
tmat <- t(as.matrix(repol[,1:4]))
colnames(tmat) <- repol$marker
pheatmap(tmat)

Of the 14 markers with known run-timing associations, 9 are fixed or nearly fixed for alternative alleles in winter and summer run fish. One additional marker (greb1_05) is highly informative, but is not fixed in winter run fish. Fall run fish and half pounder demonstrate intermediate genotypes and very little differentiation from one another.

More heatmaps

Sometimes it can be helpful to look at the pattern across individuals and along the genome rather than allele frequency. Below we make a graphical representation of some of the genotypes at run timing associated markers as well.

#first we need to get a dataset that will work

#the tab slot of the adegenet object will work if we polarize the data by selecting the allele with the higher count in either winter or early summer

#for each pair of columns, keep the one with the highest average, few enough markers, we can just choose these manually
colSums(chr28_seppop$Winter$tab, na.rm = TRUE)
##  Chr28_11607954.G  Chr28_11607954.A  Chr28_11625241.A  Chr28_11625241.G 
##                62                18                 0                80 
##  Chr28_11632591.G  Chr28_11632591.A  Chr28_11658853.A  Chr28_11658853.C 
##                63                17                 0                80 
##  Chr28_11667578.T  Chr28_11667578.C  Chr28_11671116.C  Chr28_11671116.T 
##                 0                80                 0                78 
##  Chr28_11676622.T  Chr28_11676622.G  Chr28_11683204.G  Chr28_11683204.T 
##                 0                80                 0                80 
##  Chr28_11702210.T  Chr28_11702210.G  Chr28_11773194.A  Chr28_11773194.T 
##                 0                80                72                 8 
##    Omy_GREB1_05.T    Omy_GREB1_05.G    Omy_GREB1_09.G    Omy_GREB1_09.T 
##                14                66                80                 0 
## Omy_RAD15709-53.G Omy_RAD15709-53.A Omy_RAD47080-54.A Omy_RAD47080-54.G 
##                80                 0                 0                76
#bind these together
polarized_allele_counts <- as.data.frame(chr28_seppop$Winter$tab[,c(1,4,5,8,10,12,14,16,18,19,22,23,25,28)]) %>%
  bind_rows(as.data.frame(chr28_seppop$Summer$tab[,c(1,4,5,8,10,12,14,16,18,19,22,23,25,28)])) %>%
  bind_rows(as.data.frame(chr28_seppop$fall$tab[,c(1,4,5,8,10,12,14,16,18,19,22,23,25,28)])) %>%
  bind_rows(as.data.frame(chr28_seppop$halfpounder$tab[,c(1,4,5,8,10,12,14,16,18,19,22,23,25,28)]))


#plot heatmap

#first, order the markers according to genomic position
map_results <- readxl::read_xlsx("metadata/final_mapping_results.xlsx", sheet = 1)
map_results <- map_results %>%
  mutate(marker = str_replace(marker, "Omy28", "Chr28"))

colnames(polarized_allele_counts) <- substr(colnames(polarized_allele_counts), 0, nchar(colnames(polarized_allele_counts))-2)

marker_pos <- map_results %>%
  select(marker, CRITFC_SNP_pos_genome) %>%
  filter(marker %in% colnames(polarized_allele_counts))

#hmm one marker position is missing, just add it manually
marker_pos$CRITFC_SNP_pos_genome <- as.numeric(marker_pos$CRITFC_SNP_pos_genome)
## Warning: NAs introduced by coercion
marker_pos[14,2] <- 11702210

#order to columns
polarized_allele_counts <- polarized_allele_counts %>%
  select(unlist(c(marker_pos[order(marker_pos$CRITFC_SNP_pos_genome),1]), use.names = FALSE))
#add pop info
polarized_allele_counts$pop <- c(rep("winter", nrow(chr28_seppop$Winter$tab)), rep("summer", nrow(chr28_seppop$Summer$tab)), rep("fall", nrow(chr28_seppop$fall$tab)), rep("halfpounder", nrow(chr28_seppop$halfpounder$tab)))

#split into groups for easy visualization
winter_pac <- filter(polarized_allele_counts, pop == "winter")
summer_pac <- filter(polarized_allele_counts, pop == "summer")
fall_pac <- filter(polarized_allele_counts, pop == "fall")
halfpounder_pac <- filter(polarized_allele_counts, pop == "halfpounder")

a = pheatmap(winter_pac[,-15], cluster_cols = FALSE, show_rownames = FALSE, main = "winter")

b = pheatmap(summer_pac[,-15], cluster_cols = FALSE, show_rownames = FALSE, main = "summer")

c = pheatmap(fall_pac[,-15], cluster_cols = FALSE, show_rownames = FALSE, main = "fall")

d = pheatmap(halfpounder_pac[,-15], cluster_cols = FALSE, show_rownames = FALSE, main = "halfpounder")

Omy05 Markers

The second DA in the a priori DAPC captures less among group variation than the first, however, it is oriented along a large axis of genetic variation within both the fall run and halfpounders. Part of this axis is driven by snps with known association with residency vs anadromy. Let’s explore this briefly:

First let’s take a look at the allele frequencies at residency vs anadromy alleles on chromosome 5

residency_loci_names <- marker_mapping %>%
  filter(chr == "5" | CRITFC_chromosome == "5") %>%
  filter(str_detect(`Presumed Type`, 'residency|Residency')) %>%
  select(marker)

residency_loci_names <- residency_loci_names$marker

#one of these is not in e final dataset, remove
residency_loci_names <- residency_loci_names[-6]

# get summary info
residency_counts <- allele.dist(genind_2.0[loc=residency_loci_names], mk.figures = FALSE)$count
#make into a dataframe
residency_counts <- as.data.frame(do.call(rbind, residency_counts))
colnames(residency_counts) <- c("fall_count", "half_count", "summer_count", "winter_count")
residency_counts$sum <- rowSums(residency_counts, na.rm = TRUE)

residency_freqs <- allele.dist(genind_2.0[loc=residency_loci_names], mk.figures = FALSE)$frequency
#make into a dataframe
residency_freqs <- as.data.frame(do.call(rbind, residency_freqs))

residency_freqs <- as.data.frame(cbind(residency_freqs, residency_counts))

##### get only minor allele
#then put the marker names in the same order as the allele.dist (exlude the missing marker first)
residency_freqs$marker <- rep(residency_loci_names[order(residency_loci_names)], each = 2)
#now group by marker and keep the minor allele, then convert counts to 
residency_maf <- residency_freqs %>%
  group_by(marker) %>%
  slice_min(sum) %>%
  replace(., is.na(.), 0)

#plot as a heatmap
residency_maf <- residency_maf %>%
  rename("Half-pounder" = "halfpounder", "Early-Summer" = "Summer", "Late-Summer" = "fall")
col_pal<- colorRampPalette(c("red", "white", "blue"))(256)
tmat <- t(as.matrix(residency_maf[,1:4]))
colnames(tmat) <- residency_maf$marker
pheatmap(tmat)

Fall and winter fish vary (somewhat: p = 0.3 vs 0.6) at these residency alleles, with summer and halfpounder intermediate.

Let’s look at allele frequency variance at these alleles as well:

res_divs <- filter(marker_divs, marker %in% residency_loci_names)
res_divs <- pivot_wider(res_divs, names_from = obs_exp, values_from = H)

ggplot(data=res_divs)+geom_point(aes(He, Ho, color = run), size = 3, alpha = 0.8)+scale_color_viridis_d()+theme_classic()+geom_abline(aes(slope =1, intercept = 0))+xlim(0,0.5)+ylim(0,0.5)

So variation is high within all the runs (least so at fall run), with expected heterozygosity all the way near 0.5. Interestingly there is a dearth of heterozygotes in all but summer run, however, this is significant only for halfpounder where it is the strongest. This pattern suggests that there is some structuring with the halfpounders and to a lesser extent fall run (but not summer and winter run fish), with reduced interbreeding between individuals expressing alternative residency alleles. interpreting the differences across populations suggests there’s varying degrees of introgression between anadromous and resident fish across the run types. this may be explained by the temporal and spatial variation in spawning between run types.

also something to think about here is that there is an observed tendency of hatchery released smolts to residualize over the summer, winter fish are graded (i.e. precocially mature males are not released) so it is likely to be the summer hatchery fish contributing the residuals.

also of note here is that our winter sample includes both hatchery and natural origin fihs, while our other samples contain only natural origin, so the pattern here could be due to hatchery alleles.

LD

Quick LD plot using r(bar)D of winter and summer runs (first plot) vs fall and half pounder (second) plot. Will come back and clean this up later. Absence of variation within some runs is causing code to break for plotting and estimating LD in a lot of packages. If we want formal LD, I think I’ll need to manually convert to a plink or vcf format and estimate D or r2 with plink/tassel or other .

chr28_genind <- genind_2.0[loc = run_timing_loci_names]

# first lets calculate LD (dartR has a great (fast) ld estimator that works right on genind files, so let's use this)
#ldreport_winter <- dartR::gl.report.ld(chr28_genind[pop = c("Summer", "Winter")], name = NULL, save = FALSE )
#ldreport_fall <- dartR::gl.report.ld(chr28_genind[pop = c("fall", "halfpounder")], name = NULL, save = FALSE )

#this command is breaking when running on the subset datasets (no variance to do the calcs on), lets settle on poppr's tools for now (eventually will need a better LD estimate)

invisible(ldr <- poppr::pair.ia(chr28_genind[pop = c("Summer", "Winter")], limits = c(-0.1, 1.1)))

invisible(ldr_f <- poppr::pair.ia(chr28_genind[pop = c("fall", "halfpounder")], limits = c(-0.1, 1.1)))

# reorder winter / summer
ldr_f <- rownames_to_column(as.data.frame(ldr_f), var = "rowid")
ldr_f <- ldr_f %>%
  separate(rowid, into = c("snp1", "snp2"), sep = ":")

ldr_f$snp1 <- factor(ldr_f$snp1, levels = marker_pos$marker[order(marker_pos$CRITFC_SNP_pos_genome)], ordered = TRUE)
ldr_f$snp2 <- factor(ldr_f$snp2, levels = marker_pos$marker[order(marker_pos$CRITFC_SNP_pos_genome)], ordered = TRUE)

ldr_f_opposite_tri <- ldr_f
colnames(ldr_f_opposite_tri) <- c("snp2", "snp1", "rbarD")
ldr_f <- ldr_f %>%
  bind_rows(ldr_f_opposite_tri)

ggplot(ldr_f)+geom_tile(aes(snp1, snp2, fill=rbarD))+theme_classic()+theme(axis.text.x = element_text(angle = 90))+scale_fill_gradient(low = "#bdc3c7", high= "#2c3e50")

# reorder fall_halfpounder
ldr <- rownames_to_column(as.data.frame(ldr), var = "rowid")
ldr <- ldr %>%
  separate(rowid, into = c("snp1", "snp2"), sep = ":")

ldr$snp1 <- factor(ldr$snp1, levels = marker_pos$marker[order(marker_pos$CRITFC_SNP_pos_genome)], ordered = TRUE)
ldr$snp2 <- factor(ldr$snp2, levels = marker_pos$marker[order(marker_pos$CRITFC_SNP_pos_genome)], ordered = TRUE)

ldr_opposite_tri <- ldr
colnames(ldr_opposite_tri) <- c("snp2", "snp1", "rbarD")
ldr <- ldr %>%
  bind_rows(ldr_opposite_tri)

ggplot(ldr)+geom_tile(aes(snp1, snp2, fill=rbarD))+theme_classic()+theme(axis.text.x = element_text(angle = 90))+scale_fill_gradient(low = "#bdc3c7", high= "#2c3e50")

now let’s do the same but skip the uninformative markers for better visualization

uninf_markers <- c("Chr28_11607954","Omy_GREB1_09", "Chr28_11632591", "Chr28_11773194")

ldr <- ldr %>%
  filter(!(snp1 %in%  uninf_markers)) %>%
  filter(!(snp2 %in%  uninf_markers))


ggplot(ldr)+geom_tile(aes(snp1, snp2, fill=rbarD))+theme_classic()+theme(axis.text.x = element_text(angle = 90))+scale_fill_viridis_c( limits = c(0,1))

ldr_f <- ldr_f %>%
  filter(!(snp1 %in%  uninf_markers)) %>%
  filter(!(snp2 %in%  uninf_markers))

ggplot(ldr_f)+geom_tile(aes(snp1, snp2, fill=rbarD))+theme_classic()+theme(axis.text.x = element_text(angle = 90))+scale_fill_viridis_c(limits = c(0,1))

Yes, LD is weaker among fall and half pounder than winter/summer.

LD is also about equally strong over the entire length of the region, there are no clear ld blocks within the region.

haplotypes

The identification of “discordant” genotypes at markers in the region on chromosome 28 associated with run timing (i.e. winter and summer run fish demonstrate near perfect linkage among markers in this region, but fall and halfpounders demonstrate reduced LD), suggests that recombination can occur within this region. If we can phase our data we should be able to identify chr28 haplotypes to gain more inference into the evolutionary origin/maintenance of fall/halfpounders.

Outline:
- identify snps in linkage around chr28 and chr05
- phase genotypes for these snps
- cluster phased genos to identify haplogroups and/or build haplotype network to visualize relationships and visually identify clusters

phasing

First we need to get phased genotypic data in order to identify haplotypes.

We will use fastPHASE. First lets get data in the fastphase format

# the fast phase format's genotype data for an individual is expressed on two rows with each row representing an unphased haploid genotype. There is also some metadata added into header rows. the main gt matrix  is very similar to the structure format so we'll use the same approach to reformat the data as we used to convert from genind to structure

chr28_genind <- genind_2.0[loc=run_timing_loci_names]

#note just sort of crashed through this with a text editor, not easily logged, but the general idea was transpose the data, split columns (diploid to dual haploid by adding a tab between the values), then convert N to ?, then read the result into R and transpose again

df <- genind2df(chr28_genind)

#reorder to chromosomal order here
#first, order the markers according to genomic position
map_results <- readxl::read_xlsx("metadata/final_mapping_results.xlsx", sheet = 1)
map_results <- map_results %>%
  mutate(marker = str_replace(marker, "Omy28", "Chr28"))

marker_pos <- map_results %>%
  select(marker, CRITFC_SNP_pos_genome) %>%
  filter(marker %in% colnames(df))

#hmm one marker position is missing, just add it manually
marker_pos$CRITFC_SNP_pos_genome <- as.numeric(marker_pos$CRITFC_SNP_pos_genome)
## Warning: NAs introduced by coercion
marker_pos[14,2] <- 11702210

#order to columns
df <- df %>%
  select(unlist(c(marker_pos[order(marker_pos$CRITFC_SNP_pos_genome),1]), use.names = FALSE))

# now transponse and save
df <- as.data.frame(t(df))
write_tsv(df, "genotype_data/chr28.tmp")

#in text editor, convert NA to ??, then split columns
df <- read_tsv("genotype_data/chr28.tmp", col_names = FALSE)
## Parsed with column specification:
## cols(
##   .default = col_character()
## )
## See spec(...) for full column specifications.
df <- t(df)
write_tsv(as.data.frame(df), "genotype_data/chr28.fastphase", col_names = FALSE)

#next we remove whitespaces from the genos and label the individuals like so:
# #individual 1 name
# TAGCG...
# TAGCC...
# #individual 2 name
# TAGCG...
# TAGCC...

# this was accomplished by removing all the whitespace in the genos then using the following regex in find and replace
# find: ^(\w+)\t([A-Z?]+)\n^(\w+)\t([A-Z?]+)
# replace: # \1\n\2\n\4

#then we added the header info according to the fastphase manual
# snp pos was taken from the critfc mapping data and manually added to the file

fastphase was run using default setting and attempting to minimize the switch error (Add more details on this for eventual methods)

 ~/Science/programs/fastPHASE  ../genotype_data/chr28.fastphase

haplotype network

Here we will examine the diversity and putative evolutionary history of the haplotypes.

ph_geno <- read_tsv("phasing/phased_genos2_r.txt")

First lets collect some summary info:

ph_geno <- genos_2.2 %>%
  select(Sample, run) %>%
  left_join(ph_geno, by = c("Sample" = "ind"))

ph_geno %>%
  group_by(run) %>% 
  summarise(count = n_distinct(hap))
## `summarise()` ungrouping output (override with `.groups` argument)
#write_tsv(ph_geno, "phasing/phased_genos_run.txt")

There are 2 unique haplotypes in summer sample, 9 in winter, 44 in fall and 73 in halfpounders. There are 84 total unique haplotypes.

We constructed a haplotype network using the minimum spanning approach in Popart v1.7 (also tried to use pegas, but data import and conversion kept unphasing the data or and scrambling the sample ids across the different haplotypes)

Below is the minimum spanning network of the 84 different haplotypes among greb1l region SNPS inferred using fastphase and popart (msn). Log in code chunk below

knitr::include_graphics('phasing/phased_genos2.phylip.svg')

# log

# this network was produced outside of r using popart 1.7. briefly, raw phased genotypes were converted to a phylip format (./phasing/phased_genos2.phylip) and run type was formatted as the popart trait data format (see ./phasing/traits2.txt), then we imported both, set the colors according to the viridis pallete used throughout the notebook, and ran the minimum spanning network algorithm to infer a graph/network
### DID NOT USE THIS ###

# didn't use any of this because row ids and haplotype assignments keep getting scrambled and producing misleading results

phased_loci <- read.loci("phasing/phased_genos_run.txt", header = FALSE, col.pop = 1, col.loci = 2:15)
haps <- haplotype(phased_loci, check.phase = TRUE, )
net <- haploNet(haps)

# structure -> df -> haplotype
phased_loci <- read.structure("phasing/phased_genos_run.stru", n.ind = 882, n.loc = 14, onerowperind = FALSE, col.pop = 1, col.lab = 0, col.others = 0, row.marknames = 0)
#phased_loci <- alleles2loci(phased_loci$tab,  phased = TRUE)
df <- genind2df(phased_loci)
h <- haplotype(as.matrix(df))
hn <- haploNet(h)
plot(hn, fast = FALSE)
#seems like phasing is getting lost in here somewhere because there are only 88 unique haps in the raw data

# this approach (below) works, but it keeps the haplotypes as characters making the distance calculation inappropriate
phased_loci <- read_tsv("phasing/phased_genos_run.txt")
hap <- haplotype(as.matrix(phased_loci[,-1]))
hap <- sort(hap, what = "labels")
dst <- dist.dna(hap)
dst2 <- dist
tree <- rmst(dst)

ind.hap<-with(
    stack(setNames(attr(hap, "index"), row.names(hap))), 
    table(hap=ind, pop=phased_loci$pop)
)

plot(tree, size=attr(tree, "weight"), scale.ratio = 10, labels= FALSE)

plot(tree, size=attr(tree, "weight"), scale.ratio = 100, cex = 0.8, pie=ind.hap, labels = F)
legend(50,50,colnames(ind.hap), col=rainbow(ncol(ind.hap)), pch=20)

#
hn2 <- haploNet(hap)

countHap <- function(hapl = hap, dna = phased_loci){
    with(
        stack(setNames(attr(hapl, "index"), rownames(hapl))),
        table(hapl = ind, pop = dna$pop)
    )
}


countHap()
plot(hn2, pie = countHap(), size = attr(hn2, "freq"), labels= F, scale.ratio = 100 )
legend(50,50,colnames(countHap()), col=rainbow(ncol(countHap())), pch=20)

haplotype plots

Here we make a couple rough plots to see if the haplotypes make sense.

phased_genos <- read_tsv("phasing/phased_genos_run.txt")

#polarize to winter (find the most common allele in winter convert to 1, then convert alternative allele to 0)

#gather the alleles for winter
phased_genos %>%
  filter(run == "Winter") %>%
  count(Chr28_11667578, Omy_GREB1_09, Chr28_11607954, Omy_GREB1_05, Chr28_11625241, Chr28_11632591, Chr28_11658853, `Omy_RAD47080-54`, `Omy_RAD15709-53`, Chr28_11671116, Chr28_11676622, Chr28_11683204, Chr28_11773194, Chr28_11702210) %>%
  slice(which.max(n))
#convert to polarized counts, just did this manually because there are only 14 of them and it was faster (sorry for the hardcoding later David...)
phased_genos <- phased_genos %>%
  mutate(Chr28_11667578 = if_else(Chr28_11667578 == "C", 1, 0)) %>%
  mutate(Omy_GREB1_09 = if_else(Omy_GREB1_09 == "G", 1, 0)) %>%
  mutate(Chr28_11607954 = if_else(Chr28_11607954 == "G", 1, 0)) %>%
  mutate(Omy_GREB1_05 = if_else(Omy_GREB1_05 == "G", 1, 0)) %>%
  mutate(Chr28_11625241 = if_else(Chr28_11625241 == "G", 1, 0)) %>%
  mutate(Chr28_11632591 = if_else(Chr28_11632591 == "G", 1, 0)) %>%
  mutate(Chr28_11658853 = if_else(Chr28_11658853 == "C", 1, 0)) %>%
  mutate(`Omy_RAD47080-54` = if_else(`Omy_RAD47080-54` == "G", 1, 0)) %>%
  mutate(`Omy_RAD15709-53` = if_else(`Omy_RAD15709-53` == "G", 1, 0)) %>%
  mutate(Chr28_11671116 = if_else(Chr28_11671116 == "T", 1, 0)) %>%
  mutate(Chr28_11676622 = if_else(Chr28_11676622 == "G", 1, 0)) %>%
  mutate(Chr28_11683204 = if_else(Chr28_11683204 == "T", 1, 0)) %>%
  mutate(Chr28_11773194 = if_else(Chr28_11773194 == "A", 1, 0)) %>%
  mutate(Chr28_11702210 = if_else(Chr28_11702210 == "G", 1, 0)) 

#lets get rid of the markers that are not diagnostic or heaviliy weighted in the a priori DAPC

#phased_genos <- phased_genos %>%
# select(-one_of(c("Chr28_11607954","Omy_GREB1_09", "Chr28_11632591", "Chr28_11773194")))


winter_pac <- filter(phased_genos, run == "Winter")
summer_pac <- filter(phased_genos, run == "Summer")
fall_pac <- filter(phased_genos, run == "fall")
halfpounder_pac <- filter(phased_genos, run == "halfpounder")

a = pheatmap(winter_pac[,-c(1,2)], cluster_cols = FALSE, show_rownames = FALSE, main = "winter")

b = pheatmap(summer_pac[,-c(1,2)], cluster_cols = FALSE, show_rownames = FALSE, main = "summer")

c = pheatmap(fall_pac[,-c(1,2)], cluster_cols = FALSE, show_rownames = FALSE, main = "fall")

d = pheatmap(halfpounder_pac[,-c(1,2)], cluster_cols = FALSE, show_rownames = FALSE, main = "halfpounder")

#now without the uninformative markers (i.e. only keep diagnostic + greb05)
#skipped this and made better figures in the next code chunk

# 
# phased_genos <- phased_genos %>%
#  select(-one_of(c("Chr28_11607954","Omy_GREB1_09", "Chr28_11632591", "Chr28_11773194")))
# 
# winter_pac <- filter(phased_genos, run == "Winter")
# summer_pac <- filter(phased_genos, run == "Summer")
# fall_pac <- filter(phased_genos, run == "fall")
# halfpounder_pac <- filter(phased_genos, run == "halfpounder")
# 
# #note that b won't run now because there is no variation
# a = pheatmap(winter_pac[,-c(1,2)], cluster_cols = FALSE, show_rownames = FALSE, main = "winter")
# #b = pheatmap(summer_pac[,-c(1,2)], cluster_cols = FALSE, show_rownames = FALSE, main = "summer")
# c = pheatmap(fall_pac[,-c(1,2)], cluster_cols = FALSE, show_rownames = FALSE, main = "fall")
# d = pheatmap(halfpounder_pac[,-c(1,2)], cluster_cols = FALSE, show_rownames = FALSE, main = "halfpounder")

Next let’s make some publication quality figures that more clearly express what is going on here. Goals: separate run types, within fall and half pounders separate into assignments groups, so that we can examine what types of haplotypes contribute to each, also within each plot make sure it is clustered

Plot below shows indiviudal haplotypes (from only informative loci) in the greb1L/rock1 region, haplotypes (rows) are hierarchically clustered, individual snps contributing to haplotypes (columns) are in genomic order, color is polarized so early summer is purple and winter is yellow.

#prepackaged heatmap tools are unlikely to work well given all the weird things we want to do, so lets just build our own ggplot output group by group (winter + summer + 3 each of fall and halfpounder)

#remove uninformative snps
phased_genos <- phased_genos %>%
  select(-one_of(c("Chr28_11607954","Omy_GREB1_09", "Chr28_11632591", "Chr28_11773194")))


#get assignment results

# first get the assignments
ld1_2 <- as.data.frame(dapc_full_apriori$ind.coord) %>%
  rownames_to_column(var="sample") %>%
  left_join(pops_info)

# assignments using original DAPC
CIs <- ld1_2 %>%
  group_by(pop) %>%
  summarise(loCI = quantile(LD1, probs = 0.025),
            hiCI = quantile(LD1, probs = 0.975))

#number of half pounders that fall in the 95% credible interval for winter fish assignment
assn <- ld1_2 %>%
  mutate(assignment = case_when(
    LD1 < CIs$hiCI[4]  ~ "winter", 
    LD1 > CIs$loCI[3] ~ "summer", 
    LD1 <= CIs$loCI[3] & LD1 >= CIs$hiCI[4] ~ "unassigned")) %>%
  select(sample, assignment)


#split the halfpounders and fall into groups by assignment
half_fall_ph <- phased_genos %>%
  filter(run == "halfpounder" | run == "fall") %>%
  left_join(assn)

fall_w <- half_fall_ph %>%
  filter(run == "fall" & assignment == "winter")

fall_s <- half_fall_ph %>%
  filter(run == "fall" & assignment == "summer")

fall_u <- half_fall_ph %>%
  filter(run == "fall" & assignment == "unassigned")

halfpounder_w <- half_fall_ph %>%
  filter(run == "halfpounder" & assignment == "winter")

halfpounder_s <- half_fall_ph %>%
  filter(run == "halfpounder" & assignment == "summer")

halfpounder_u <- half_fall_ph %>%
  filter(run == "halfpounder" & assignment == "unassigned")

winter_pac <- phased_genos %>%
  filter(run == "Winter")

summer_pac <- phased_genos %>%
  filter(run == "Summer")

# get by group clustering results
require(ggdendro)
dendro_winter <- as.dendrogram(hclust(d = dist(winter_pac[,-c(1,2)])))
dendro_summer <- as.dendrogram(hclust(d = dist(summer_pac[,-c(1,2)])))
dendro_fall_w <- as.dendrogram(hclust(d = dist(fall_w[,-c(1,2,13)])))
dendro_fall_s <- as.dendrogram(hclust(d = dist(fall_s[,-c(1,2,13)])))
dendro_fall_u <- as.dendrogram(hclust(d = dist(fall_u[,-c(1,2,13)])))
dendro_halfpounder_w <- as.dendrogram(hclust(d = dist(halfpounder_w[,-c(1,2,13)])))
dendro_halfpounder_s <- as.dendrogram(hclust(d = dist(halfpounder_s[,-c(1,2,13)])))
dendro_halfpounder_u <- as.dendrogram(hclust(d = dist(halfpounder_u[,-c(1,2,13)])))

#add row ids for ordering later
winter_pac <- rowid_to_column(winter_pac, "row_id")
summer_pac <- rowid_to_column(summer_pac, "row_id")
fall_w <- rowid_to_column(fall_w, "row_id")
fall_s <- rowid_to_column(fall_s, "row_id")
fall_u <- rowid_to_column(fall_u, "row_id")
halfpounder_w <- rowid_to_column(halfpounder_w, "row_id")
halfpounder_s <- rowid_to_column(halfpounder_s, "row_id")
halfpounder_u <- rowid_to_column(halfpounder_u, "row_id")

#############
# convert to long format, retain dendrogram order and snp order
##############

long_winter <- winter_pac %>%
  pivot_longer(cols = !(c("run", "sample", "row_id")) ,names_to = "snp", values_to = "allele_count") %>%
  mutate(row_id = factor(row_id, levels=unique(winter_pac$row_id[order.dendrogram(dendro_winter)]), ordered = TRUE)) %>%
  mutate(snp = factor(snp, levels = c("Omy_GREB1_05", "Chr28_11625241", "Chr28_11658853", "Chr28_11667578", "Omy_RAD47080-54", "Omy_RAD15709-53", "Chr28_11671116", "Chr28_11676622", "Chr28_11683204", "Chr28_11702210"), ordered = TRUE))

long_summer <- summer_pac %>%
  pivot_longer(cols = !(c("run", "sample", "row_id")) ,names_to = "snp", values_to = "allele_count") %>%
  mutate(row_id = factor(row_id, levels=unique(summer_pac$row_id[order.dendrogram(dendro_summer)]), ordered = TRUE)) %>%
  mutate(snp = factor(snp, levels = c("Omy_GREB1_05", "Chr28_11625241", "Chr28_11658853", "Chr28_11667578", "Omy_RAD47080-54", "Omy_RAD15709-53", "Chr28_11671116", "Chr28_11676622", "Chr28_11683204", "Chr28_11702210"), ordered = TRUE))

long_fall_w <- fall_w %>%
  pivot_longer(cols = !(c("run", "sample", "row_id", "assignment")) ,names_to = "snp", values_to = "allele_count") %>%
  mutate(row_id = factor(row_id, levels=unique(fall_w$row_id[order.dendrogram(dendro_fall_w)]), ordered = TRUE)) %>%
  mutate(snp = factor(snp, levels = c("Omy_GREB1_05", "Chr28_11625241", "Chr28_11658853", "Chr28_11667578", "Omy_RAD47080-54", "Omy_RAD15709-53", "Chr28_11671116", "Chr28_11676622", "Chr28_11683204", "Chr28_11702210"), ordered = TRUE))

long_fall_s <- fall_s %>%
  pivot_longer(cols = !(c("run", "sample", "row_id", "assignment")) ,names_to = "snp", values_to = "allele_count") %>%
  mutate(row_id = factor(row_id, levels=unique(fall_s$row_id[order.dendrogram(dendro_fall_s)]), ordered = TRUE)) %>%
  mutate(snp = factor(snp, levels = c("Omy_GREB1_05", "Chr28_11625241", "Chr28_11658853", "Chr28_11667578", "Omy_RAD47080-54", "Omy_RAD15709-53", "Chr28_11671116", "Chr28_11676622", "Chr28_11683204", "Chr28_11702210"), ordered = TRUE))

long_fall_u <- fall_u %>%
  pivot_longer(cols = !(c("run", "sample", "row_id", "assignment")) ,names_to = "snp", values_to = "allele_count") %>%
  mutate(row_id = factor(row_id, levels=unique(fall_u$row_id[order.dendrogram(dendro_fall_u)]), ordered = TRUE)) %>%
  mutate(snp = factor(snp, levels = c("Omy_GREB1_05", "Chr28_11625241", "Chr28_11658853", "Chr28_11667578", "Omy_RAD47080-54", "Omy_RAD15709-53", "Chr28_11671116", "Chr28_11676622", "Chr28_11683204", "Chr28_11702210"), ordered = TRUE))

long_half_u <- halfpounder_u %>%
  pivot_longer(cols = !(c("run", "sample", "row_id", "assignment")) ,names_to = "snp", values_to = "allele_count") %>%
  mutate(row_id = factor(row_id, levels=unique(halfpounder_u$row_id[order.dendrogram(dendro_halfpounder_u)]), ordered = TRUE)) %>%
  mutate(snp = factor(snp, levels = c("Omy_GREB1_05", "Chr28_11625241", "Chr28_11658853", "Chr28_11667578", "Omy_RAD47080-54", "Omy_RAD15709-53", "Chr28_11671116", "Chr28_11676622", "Chr28_11683204", "Chr28_11702210"), ordered = TRUE))

long_half_w <- halfpounder_w %>%
  pivot_longer(cols = !(c("run", "sample", "row_id", "assignment")) ,names_to = "snp", values_to = "allele_count") %>%
  mutate(row_id = factor(row_id, levels=unique(halfpounder_w$row_id[order.dendrogram(dendro_halfpounder_w)]), ordered = TRUE)) %>%
  mutate(snp = factor(snp, levels = c("Omy_GREB1_05", "Chr28_11625241", "Chr28_11658853", "Chr28_11667578", "Omy_RAD47080-54", "Omy_RAD15709-53", "Chr28_11671116", "Chr28_11676622", "Chr28_11683204", "Chr28_11702210"), ordered = TRUE))

long_half_s <- halfpounder_s %>%
  pivot_longer(cols = !(c("run", "sample", "row_id", "assignment")) ,names_to = "snp", values_to = "allele_count") %>%
  mutate(row_id = factor(row_id, levels=unique(halfpounder_s$row_id[order.dendrogram(dendro_halfpounder_s)]), ordered = TRUE)) %>%
  mutate(snp = factor(snp, levels = c("Omy_GREB1_05", "Chr28_11625241", "Chr28_11658853", "Chr28_11667578", "Omy_RAD47080-54", "Omy_RAD15709-53", "Chr28_11671116", "Chr28_11676622", "Chr28_11683204", "Chr28_11702210"), ordered = TRUE))

############
#build heatmaps
############

heatmap_w <- ggplot(data = filter(long_winter), aes(x = snp, y = row_id)) +
  geom_tile(aes(fill = allele_count))+scale_fill_viridis_c()+theme_classic()+theme( axis.text.y = element_blank(), axis.ticks.y = element_blank(), axis.line.y = element_blank(), legend.position = "none", axis.text.x = element_blank())+ylab("Winter\n ")+xlab(element_blank())

heatmap_s <- ggplot(data = filter(long_summer), aes(x = snp, y = row_id)) +
  geom_tile(aes(fill = allele_count))+theme_classic()+scale_fill_gradient(low = "#440154FF", high = "#440154FF")+theme( axis.text.y = element_blank(), axis.ticks.y = element_blank(), axis.line.y = element_blank(), legend.position = "none", axis.text.x = element_blank())+ylab("Early-Summer\n ")+xlab(element_blank())

heatmap_fw <- ggplot(data = filter(long_fall_w), aes(x = snp, y = row_id)) +
  geom_tile(aes(fill = allele_count))+scale_fill_viridis_c()+theme_classic()+theme( axis.text.y = element_blank(), axis.ticks.y = element_blank(), axis.line.y = element_blank(), legend.position = "none", axis.text.x = element_blank())+ylab("Late-Summer\nWinter Assigned")+xlab(element_blank())

heatmap_fs <- ggplot(data = filter(long_fall_s), aes(x = snp, y = row_id)) +
  geom_tile(aes(fill = allele_count))+scale_fill_viridis_c()+theme_classic()+theme( axis.text.y = element_blank(), axis.ticks.y = element_blank(), axis.line.y = element_blank(), legend.position = "none", axis.text.x = element_blank())+ylab("Late-Summer\nEarly-Summer ssigned")+xlab(element_blank())

heatmap_fu <- ggplot(data = filter(long_fall_u), aes(x = snp, y = row_id)) +
  geom_tile(aes(fill = allele_count))+scale_fill_viridis_c()+theme_classic()+theme( axis.text.y = element_blank(), axis.ticks.y = element_blank(), axis.line.y = element_blank(), legend.position = "none", axis.text.x = element_blank())+ylab("Late-Summer\nUnassigned")+xlab(element_blank())

heatmap_hw <- ggplot(data = filter(long_half_w), aes(x = snp, y = row_id)) +
  geom_tile(aes(fill = allele_count))+scale_fill_viridis_c()+theme_classic()+theme( axis.text.y = element_blank(), axis.ticks.y = element_blank(), axis.line.y = element_blank(), legend.position = "none", axis.text.x = element_blank())+ylab("Halfpounder\nWinter Assigned")+xlab(element_blank())

heatmap_hs <- ggplot(data = filter(long_half_s), aes(x = snp, y = row_id)) +
  geom_tile(aes(fill = allele_count))+scale_fill_viridis_c()+theme_classic()+theme( axis.text.y = element_blank(), axis.ticks.y = element_blank(), axis.line.y = element_blank(), legend.position = "none", axis.text.x = element_blank())+ylab("Halfpounder\nSummer Assigned")+xlab(element_blank())

heatmap_hu <- ggplot(data = filter(long_half_u), aes(x = snp, y = row_id)) +
  geom_tile(aes(fill = allele_count))+scale_fill_viridis_c()+theme_classic()+theme(axis.text.x = element_text(angle = 90), axis.text.y = element_blank(), axis.ticks.y = element_blank(), axis.line.y = element_blank(), legend.position = "none")+ylab("Halfpounder\nUnassigned")


#make the final plot
require(gridExtra)
grid.arrange(heatmap_s, heatmap_w, heatmap_fs, heatmap_fw, heatmap_fu, heatmap_hs, heatmap_hw, heatmap_hu,  ncol = 1, heights = c(3,3,3,3,3,3,3,7), clip = FALSE)

haplotype network 2

Let’s also build a haplotype network only for the informative markers.

This involves removing the uninformative markers from the .phylip file and running popart again (See above)

Let’s also collect the same stats again

ph_geno <- read_tsv("phasing/phased_genos3_r.txt")

ph_geno <- genos_2.2 %>%
  select(Sample, run) %>%
  left_join(ph_geno, by = c("Sample" = "ind"))

ph_geno %>%
  group_by(run) %>% 
  summarise(count = n_distinct(hap))
#ran popart on phased_genos3.phylip and traits2.txt
knitr::include_graphics('phasing/phased_genos3.phylip.svg')

summary

The density of markers in greb1L/rock1 region in our GTseq dataset allows us to identify characteristic winter and early-summer run haplotypes from our samples. if we focus on only the sites that vary strongly between winter and summer runs, there is a single summer haplotype (note there are multiple haps within this group when considering all sites, not just the diagnostic + heavily weighted ones), and a handful (5?) haplotypes within winter run.

summer and winter assigned halfpounder and fall run fish possess largely winter and summer run haplotypes. unassigned halfpounders and fall run fish also possess 1 copy each of winter and summer run haplotypes, suggesting many unassigned fish are first generation hybrids between winter and summer runs, however, there are also many highly recombined haplotypes among the unassigned halfpounders and fall run fish. evidence of recombination within the greb1L/rock1 region haplotypes suggests gene flow between winter and summer run fish occurs beyond first generation hybrids. however we did not observe any recombined greb1L/rock1 region haplotypes within our sample of winter and summer run fish.

this opens up a question, what spawning events allow for recombination? we propose that there may be partial temporal isolation of early-summer from winter run fish, with fall run adults serving as a bridge between the extreme distribution of run timing. indeed we also discovered a weak but significant association between DAPC DA1 score and arrival timing among fall run fish.

this absence of recombined haplotypes within winter and summer samples may reflect extreme phenotype sampling (adult samples were identified as winter and summer runs by timing of arrival at spawning grounds, taken on far ends of distribution (late winters vs early early-summers))

this also calls into question what phenotype we are looking at, runs are named based on their timing of freshwater entry, but appear to also segregate at least weakly on the basis of spawning time/arrival at spawning grounds.

to continue thinking about differences between halfpounder and fall haplotype results?
where do halfpounders come in here?

Discussion

Two multivariate approaches (unconstrained ordination (PCA) and constrained ordination (DAPC)) as well as a model-based bayesian clustering approach (STRUCTURE) were used to interrogate population genetic structure among 3 runs of adult Rogue River steelhead as well as half-pounders. We also estimated genetic distances among run types using Fst. These approaches attempt to identify structure either (a) among a priori assigned grouping of organisms (in our case run timing), or (b) among de novo identified genetic clusters. We draw inferences on population genetic structure within the Rogue River based on the strong agreement among multivariate and bayesian as well as de novo and a priori approaches to characterizing population genetic structure. However, it is important to note that our results are based on the small portion of total genetic variation represented by our GTseq panel markers, and that these markers do not represent an unbiased representation of genetic variation among Rogue River steelhead. For example, “neutral” markers in our panel were chosen to capture genetic variation at wide spatial scales and not within individual drainages, so our study may underestimate the extent of population differentiation at neutral markers. Similarly, the inclusion of multiple markers associated with run-timing phenotypes allows us to capture strong differentiation at these genomic regions, but these regions are strong outliers across many population genetic parameters, are likely over represented in our dataset relative to neutral markers and generally will overestimate genome-wide genetic differentiation. Most critically, our dataset may exclude genetic variants that drive adaptive phenotypic variation within the Rogue River. As many of our approaches utilize different subsets of markers, and differ in their sensitivity to various biases in the dataset, it is important to consider each in light of its respective caveats and underlying assumptions.

No Difference Among Late-Summer Adults and Half-Pounders
While there is evidence of structure within half pounders and late-summer run steelhead (see discussion below), we did not find any evidence to suggest restricted gene flow among these two groups. Genetic differentiation (Fst) was low at both neutral markers and on average across the full GTseq dataset and there is little distance among group centroids in both the neutral and full PCA. STRUCTURE results suggest that half-pounders and late summer run fish derive similar proportions of their ancestry from different ancestral genetic clusters, regardless of the number of genetic clusters being modeled. Additionally, individuals within each group demonstrate similar variance in this metric. We also failed to discriminate among half-pounder and late summer run fish using a discriminant analysis of principal components of genetic variation in our dataset.

Structure Among Winter and Early-Summer Run Steelhead
In contrast, we identify population genetic structure among winter and early-summer run steelhead at both neutral and adaptive genetic markers. Using markers annotated as neutral in our GTseq dataset, total differentiation is low (Fst ~ 1%), and there is little to no differentiation apparent in the PCA, however, this low level of differentiation is sufficient to clearly discriminate among winter and summer run fish using DAPC.

When using the full dataset, structure among winter and early summer run fish is strong. When 3 or more ancestral genetic clusters are modeled, STRUCTURE estimates alternative genetic clusters that dominate the ancestry of winter and early summer run steelhead, although all individuals demonstrate some degree of admixture with the alternative population’s dominant cluster. Using DAPC, we are able to describe a single axis of genetic variation that strongly discriminate among winter and early-summer run fish. This axis is primarily driven by variation at markers with known associations to run-timing that are diagnostic or nearly diagnostic among winter and early-summer run steelehad, and, to a lesser extent, several neutral annotated markers and markers with known associations to residency vs anadromy.

Structure within Late-Summer and Half-Pounders
The matter of separating late-summer and half-pounder steelhead from early-summer and winter runs is less clear. While all analyses that identified structure within the Rogue River steelhead place half-pounder and late-summer run steelhead as intermediate between winter and early-summer steelhead at the group-level, there is (a) substantial overlap in genetic variation among both half-pounder and fall run steelhead with winter and early summer run steelhead and (b) evidence of population structure within half-pounder and late-summer run steelhead.

Inference of structure within both fall and half-pounder fish comes from several lines of evidence. Among half-pounders, and to a lesser extent late-summer run steelhead, many markers were out of HWE with a significant excess of homozygosity, suggestive of Wahlund effects. These markers were significantly enriched for run-timing associated markers, suggesting that structure within half-pounders and fall run fish is due in part to groups of either early-summer-like or winter-like individuals. Also, de novo identification of population genetic structure via k means clustering always separated late-summer and half-pounder individuals into genetic clusters with both late-summer and winter run adults, regardless of the number of clusters used. Finally, unlike k-means clustering, in STRUCTURE, individuals may derive their ancestry from multiple clusters, and we see that while there are no strong differences at the group level in ancestry, there is a high level of variation within each group.

Indeed, when we attempt to classify both half-pounder and late-summer run fish as winter-like or early-summer like using a linear combination of alleles that strongly discriminate winter and early summer run fish (discriminant axis 1), we observe a group of strongly winter-like fish, a second group of intermeditate fish and a small number of early-summer-like fish. Direct examination of genotypes at known-run timing markers reveals the same pattern. One caveat here that might be worth including in anything published for a wider audience: Methods like the DAPC, and even the association analyses that are the basis of the annotations for our panel, operate under an assumption of additivity. Without a clear, mechaninistic understanding of the g -> p map for run timing, we can not predict run-timing phenotypes from such heterozygous and discordant genotypes at run-timing associated markers. Variation in effect size among loci, dominance, epistasis, the effects of un-genotyped causal loci and plasticity all may be at play here.

add discussion about diagnostic vs highly weighted in DAPC and about difference in phenotypes used for annotation (CRITFC annotations are for interior pops and the phenotype is arrival timing on spawning grounds, whereas prince et al annotations are for freshwater entry) compare our resutls with those from the latest critfc paper and the coastal lineages

also add discussion about how some markers with known annotations are not diagnostic or not even informative, and this could have somehting to do with divergence in the summer haplotype as it spread (assuming a single origin)

In addition to structure within fall and half-pounders along the genetic axis differentiating early-summer and winter runs, we also identified structure within these groups driven by both neutral markers and adaptive markers, including all 5 markers in the dataset with residency-vs-anadromy annotations and a neutral annotated marker with strong residency-vs-anaadromy associations in the Klickitat River.

Taken together, we propose that both late-summer run and half-pounders are a mixed assemblage of individuals demonstrating genetic characteristics inclusive of winter-run, early-summer run and a highly heterozygous intermediate group. Also, despite mark-recapture evidence that early-summer and late-summer run adults are found together on the spawning grounds, genetic differentiation at GTseq panel markers suggests late-summer (and half-pounder) run adults are less differentiated from winter run than early-summer run adults at both neutral and adaptive regions of the genome.

Stray thought / hypothesis collecter

This section is a dumping ground for ideas if we want to move forward with this investigation, and stray thoughts that might warrant returning to in the future. Not edited, remove before sharing widely.

Intermediate freq Ch28
How is it possible to have intermediate MAFs within the Ch28 region at a small subset of markers, when the rest are nearly fixed within the region within winter/early summer runs? These “discordant” genotypes suggests either (a) multiple haplotypes (more than two) segregating within the region, (b) two (or more) separate haplotype blocks with the “RoSA” that are not coinherited (i.e. wildly high recombination in a region that seems to have low recombination in the rest of the species…unlikely), (c) extremely strong balancing selection at the genomic region level (also would be a wild thing to observe) or (d) major genotyping error (Need to go in and check out the allele correction values / genotyping plot for these markers: Chr28_11607954, Chr28_11632591, Ch28_11773194, Omy_GREB1_09)

  • checked out the genotyping plots for these markers, nothing looks out of whack. however, Omy_GREB1_09 is a poorly mapping marker by both SFGL and CRITFC mapping results.

Should also think about whether or not the sequencing data we have might be used to call statistically phased haplotypes for this region (it’s pretty densely genotyped). Conducting an analysis with information with the provided by haplotypes can probably get to the bottom of this, but should think more seriously about what it will provide before investing the time to figure it out.

Is an extremely recent duplication even another possible explanation for this weird pattern that we see.

Neutral structure outliers (strays) in halfpounders
There is evidence that half pounders may return to Rogue despite being native to a different stream, then return to the their natal stream during their subsequent adult spawning run(s). Do we see evidence of any half-pounders from other streams (i.e. outlier fish at neutral loci). This behavior is also predicted as a consequence of the proposed explanation of the half pounder lifestyle as a behavioral adaptation to avoid a recurring region of high SSTs near the Klamath and Rogue basins

Quote from Eversest 1971: Additional evidence of this behavior was noted when returns from “half-pounders” tagged in 1968 were analyzed. Thirty-one fish were tagged in one seine-haul on August 14, 1968, and five were recaptured in the Rogue the sane year. On subsequent migrations in 1969 and 1970, only one of the fish was recaptured in the Rogue while two were taken outside the system, one in the Smith River, and one from the Trinity River, both in northern California.

Evolutionary explanation for maintenance of half-pounders
The results as presented here leave us with something of a problem, if field studies suggest there is limited gene flow among early-summer and winter fish, then (i) why are fall run fish more closely related to winter fish, and (ii) how is the existence of the half pounder or fall run lifestyle maintained. Temporally balancing selection is generally invoked to maintain the existence of multiple life-histories in salmonids, but in this case we may have an example of true balancing selection (heterozygote advantage) as the half-pounder lifehistory (at least at the tiny number of markers considered here) seems to be driven by high heterozygosity at run timing associated markers. The same goes for fall run fish. If return to the rogue in late summer is driven by unfavorable sea surface conditions and a SST “trap,” it would follow that both half pounders and fall run life histories are behavioral adaptations under strong selection, and the alternative life-histories are less fit, but maintained as a short term (non-equilibrium) consequence of balancing selection. In other words, the SST trap leads to strong selection for a fall run and this has different impacts depending on the level of maturity for an individual: for juveniles a fall run results in a false spawingn run (half pounder), for adults, this results in a fall run. Individuals that fail to make the fall run (either as half pounders or mature adults) face selection. This would be a pretty shocking result, but I’m failing to find an alternative explanation. It definitely motivates further study either via a GWAS or a genome wide perspective of demography and selection within the Rogue. I’d be particularly interested in collecting haplotypic data to examine recent demographic histories and evidence of selection. At the very least looking for balancing selection using outlier approaches within fall run fish.

Proportions
do the assignment proportions in half pounder and fall run fish match with the run counts for the appropriately matched year?

pattern within fall
we have dates for approximate freshwater entry for the fall run fish. is there a correlation in the winter-vs-summer genetic axis and date of entry? there is substantial overlap in the entry timing of fish referred to as winter and late-summer run, and substantial interannual variation in the peak of run timing within a group, taken together this suggests that some late late-summer fish could instead be early-arriving winter run fish. this would show up as a correlation between run timing genotype and arrival date with the fall fish. let’s check real quick

#intake files
half_2018_intake <- readxl::read_xlsx("../meta_data/2018_halfpounder/OmyJC18ROGR_STHP Intake form Spread sheet.xlsx", sheet = 3)
half_2019_intake <- readxl::read_xlsx("../meta_data/2019_halfpounder/STHP Intake form Spread sheet 2019.xlsx", sheet = 1)
fall_intake <- readxl::read_xlsx("../meta_data/2019_fall/Rogue Adult Summer and Winter (06 11 20).xlsx", sheet = 1)
summer_intake <- readxl::read_xlsx("../meta_data/2019_summer/Omy Rogue2019 steelhead datasheets.xlsx", sheet = 2)
winter_intake <- readxl::read_xlsx("../meta_data/2019_winter/StW Scales for DNA (06 17 19).xlsx", sheet = 3)

#merge intakes
# first clean them up a bit to make merging easier
half_2018_intake <- half_2018_intake[,c(1,6)]
colnames(half_2018_intake)[1] <- "ID"
half_2018_intake$run <- "halfpounder"
half_2018_intake$year <- "2018"
colnames(half_2018_intake) <- c("ID", "Date", "run", "year")

half_2019_intake <- half_2019_intake[,c(1,2)]
half_2019_intake$run <- "halfpounder"
half_2019_intake$year <- "2019"
colnames(half_2019_intake) <- c("ID", "Date", "run", "year")

summer_intake <- summer_intake[,c(2,3,6)]
colnames(summer_intake) <- c("ID", "Date", "run")
summer_intake$year <- "2019"

winter_intake <- winter_intake[,c(2,3,7)]
colnames(winter_intake) <- c("ID", "Date", "run")
winter_intake$year <- "2019"

fall_intake <- subset(fall_intake, Run == "Summer")
fall_intake <- fall_intake[,c(1,3,6)]
colnames(fall_intake) <- c("ID", "Date", "run")
fall_intake$run <- "fall"
fall_intake$year <- "2019"

meta_data <- bind_rows(half_2018_intake, half_2019_intake, fall_intake, winter_intake, summer_intake)

cor_data <- meta_data %>%
  right_join(ld1_2, by = c("ID" = "sample" ))

ggplot(cor_data[cor_data$run=="fall",])+geom_point(aes(Date,LD1))+theme_classic()+geom_smooth(aes(Date,LD1), method = "lm")

cor_data$jdate <- format(as.POSIXlt(cor_data$Date, format = "%y-%m-%d"), "%j")


summary(lm(LD1~as.numeric(jdate), data=cor_data[cor_data$run=="fall",]))
## 
## Call:
## lm(formula = LD1 ~ as.numeric(jdate), data = cor_data[cor_data$run == 
##     "fall", ])
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.17048 -0.72761 -0.04607  0.70425  2.60920 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)  
## (Intercept)        2.534279   1.290038   1.965   0.0513 .
## as.numeric(jdate) -0.010270   0.005098  -2.014   0.0457 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9225 on 155 degrees of freedom
## Multiple R-squared:  0.02551,    Adjusted R-squared:  0.01923 
## F-statistic: 4.058 on 1 and 155 DF,  p-value: 0.04569
summary(lm(LD1~as.numeric(jdate), data=cor_data[cor_data$run=="halfpounder",]))
## 
## Call:
## lm(formula = LD1 ~ as.numeric(jdate), data = cor_data[cor_data$run == 
##     "halfpounder", ])
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.3382 -0.8120 -0.2633  0.7326  3.2421 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)  
## (Intercept)        1.766224   0.832324   2.122   0.0342 *
## as.numeric(jdate) -0.007381   0.003324  -2.221   0.0267 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.063 on 641 degrees of freedom
## Multiple R-squared:  0.007635,   Adjusted R-squared:  0.006087 
## F-statistic: 4.932 on 1 and 641 DF,  p-value: 0.02671
ggplot(cor_data[cor_data$run=="halfpounder",])+geom_point(aes(jdate,LD1))+theme_classic()+geom_smooth(aes(jdate,LD1), method = "lm")

There is a significant, but minor trend (r2 = 0.03, p = 0.04) towards more genetically winter-like fish over time among fall run fish.

also look within halfpounders

other

interannual variation

let’s check to make sure there are no interesting patterns / structure among halfpounder years

# create the dataset
half_genind <- seppop(genind_2.0)$halfpounder
years <- str_sub(row.names(half_genind@tab), 6,7)
half_genind$pop <- as.factor(years)

# optimized number of pcs with optim.a.score, smart = F
#half_dapc_year <- dapc(half_genind, n.pca = 200, n.da =1)
#optim.a.score(half_dapc_year, smart = FALSE)
half_dapc_year <- dapc(half_genind, n.pca = 31, n.da =1)
scatter.dapc(half_dapc_year)

Even with DAPC, very little structure year to year in halfpounders.

data export

#merge polarized allele count data with sample site, population year    sample date sample location genotyped based on the 9 loci in bold-face type origin

# first get the assignments
ld1_2 <- as.data.frame(dapc_full_apriori$ind.coord) %>%
  rownames_to_column(var="sample") %>%
  left_join(pops_info)

# assignments using original DAPC
CIs <- ld1_2 %>%
  group_by(pop) %>%
  summarise(loCI = quantile(LD1, probs = 0.025),
            hiCI = quantile(LD1, probs = 0.975))

#number of half pounders that fall in the 95% credible interval for winter fish assignment
assn <- ld1_2 %>%
  mutate(assignment = case_when(
    LD1 < CIs$hiCI[4]  ~ "winter", 
    LD1 > CIs$loCI[3] ~ "summer", 
    LD1 <= CIs$loCI[3] & LD1 >= CIs$hiCI[4] ~ "unassigned")) %>%
  select(sample, assignment)

#get hor/nor
md <- readxl::read_xlsx("./metadata/Omy Rogue2019 steelhead datasheets.xlsx", sheet = 4)

md <- md %>%
  mutate(Origin = case_when(
    Origin == "AD" ~ "HOR",
    Origin == "1" ~ "NOR",
    TRUE ~ Origin
    
  ))

to_exp <- polarized_allele_counts %>%
  rownames_to_column(var = "sample") %>%
  left_join(assn) %>%
  left_join(select(genos_2.2, Sample, Date, run , year), by = c("sample" = "Sample")) %>%
  left_join(select(md, SFGL_ID, Origin), by = c("sample" = "SFGL_ID"))

write_tsv(to_exp, "genotype_data/chr28_polarized_allele_counts_2.txt")